sched/fair: fix pick_eevdf to always find the correct se

Message ID xm261qego72d.fsf_-_@google.com
State New
Headers
Series sched/fair: fix pick_eevdf to always find the correct se |

Commit Message

Benjamin Segall Sept. 30, 2023, 12:09 a.m. UTC
  The old pick_eevdf could fail to find the actual earliest eligible
deadline when it descended to the right looking for min_deadline, but it
turned out that that min_deadline wasn't actually eligible. In that case
we need to go back and search through any left branches we skipped
looking for the actual best _eligible_ min_deadline.

This is more expensive, but still O(log n), and at worst should only
involve descending two branches of the rbtree.

I've run this through a userspace stress test (thank you
tools/lib/rbtree.c), so hopefully this implementation doesn't miss any
corner cases.

Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Signed-off-by: Ben Segall <bsegall@google.com>
---
 kernel/sched/fair.c | 72 ++++++++++++++++++++++++++++++++++++---------
 1 file changed, 58 insertions(+), 14 deletions(-)
  

Comments

Marek Szyprowski Oct. 4, 2023, 8:39 p.m. UTC | #1
Hi,

On 30.09.2023 02:09, Benjamin Segall wrote:
> The old pick_eevdf could fail to find the actual earliest eligible
> deadline when it descended to the right looking for min_deadline, but it
> turned out that that min_deadline wasn't actually eligible. In that case
> we need to go back and search through any left branches we skipped
> looking for the actual best _eligible_ min_deadline.
>
> This is more expensive, but still O(log n), and at worst should only
> involve descending two branches of the rbtree.
>
> I've run this through a userspace stress test (thank you
> tools/lib/rbtree.c), so hopefully this implementation doesn't miss any
> corner cases.
>
> Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
> Signed-off-by: Ben Segall <bsegall@google.com>

This patch landed in today's linux-next as commit 561c58efd239 
("sched/fair: Fix pick_eevdf()"). Surprisingly it introduced a warning 
about circular locking dependency. It can be easily observed during boot 
from time to time on on qemu/arm64 'virt' machine:

======================================================
WARNING: possible circular locking dependency detected
6.6.0-rc4+ #7222 Not tainted
------------------------------------------------------
systemd-udevd/1187 is trying to acquire lock:
ffffbcc2be0c4de0 (console_owner){..-.}-{0:0}, at: 
console_flush_all+0x1b0/0x500

but task is already holding lock:
ffff5535ffdd2b18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0xe0/0xc40

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #4 (&rq->__lock){-.-.}-{2:2}:
        _raw_spin_lock_nested+0x44/0x5c
        raw_spin_rq_lock_nested+0x24/0x40
        task_fork_fair+0x3c/0xac
        sched_cgroup_fork+0xe8/0x14c
        copy_process+0x11c4/0x1a14
        kernel_clone+0x8c/0x400
        user_mode_thread+0x70/0x98
        rest_init+0x28/0x190
        arch_post_acpi_subsys_init+0x0/0x8
        start_kernel+0x594/0x684
        __primary_switched+0xbc/0xc4

-> #3 (&p->pi_lock){-.-.}-{2:2}:
        _raw_spin_lock_irqsave+0x60/0x88
        try_to_wake_up+0x58/0x468
        default_wake_function+0x14/0x20
        woken_wake_function+0x20/0x2c
        __wake_up_common+0x94/0x170
        __wake_up_common_lock+0x7c/0xcc
        __wake_up+0x18/0x24
        tty_wakeup+0x34/0x70
        tty_port_default_wakeup+0x20/0x38
        tty_port_tty_wakeup+0x18/0x24
        uart_write_wakeup+0x18/0x28
        pl011_tx_chars+0x140/0x2b4
        pl011_start_tx+0xe8/0x190
        serial_port_runtime_resume+0x90/0xc0
        __rpm_callback+0x48/0x1a8
        rpm_callback+0x6c/0x78
        rpm_resume+0x438/0x6d8
        pm_runtime_work+0x84/0xc8
        process_one_work+0x1ec/0x53c
        worker_thread+0x298/0x408
        kthread+0x124/0x128
        ret_from_fork+0x10/0x20

-> #2 (&tty->write_wait){....}-{2:2}:
        _raw_spin_lock_irqsave+0x60/0x88
        __wake_up_common_lock+0x5c/0xcc
        __wake_up+0x18/0x24
        tty_wakeup+0x34/0x70
        tty_port_default_wakeup+0x20/0x38
        tty_port_tty_wakeup+0x18/0x24
        uart_write_wakeup+0x18/0x28
        pl011_tx_chars+0x140/0x2b4
        pl011_start_tx+0xe8/0x190
        serial_port_runtime_resume+0x90/0xc0
        __rpm_callback+0x48/0x1a8
        rpm_callback+0x6c/0x78
        rpm_resume+0x438/0x6d8
        pm_runtime_work+0x84/0xc8
        process_one_work+0x1ec/0x53c
        worker_thread+0x298/0x408
        kthread+0x124/0x128
        ret_from_fork+0x10/0x20

-> #1 (&port_lock_key){..-.}-{2:2}:
        _raw_spin_lock+0x48/0x60
        pl011_console_write+0x13c/0x1b0
        console_flush_all+0x20c/0x500
        console_unlock+0x6c/0x130
        vprintk_emit+0x228/0x3a0
        vprintk_default+0x38/0x44
        vprintk+0xa4/0xc0
        _printk+0x5c/0x84
        register_console+0x1f4/0x420
        serial_core_register_port+0x5a4/0x5d8
        serial_ctrl_register_port+0x10/0x1c
        uart_add_one_port+0x10/0x1c
        pl011_register_port+0x70/0x12c
        pl011_probe+0x1bc/0x1fc
        amba_probe+0x110/0x1c8
        really_probe+0x148/0x2b4
        __driver_probe_device+0x78/0x12c
        driver_probe_device+0xd8/0x160
        __device_attach_driver+0xb8/0x138
        bus_for_each_drv+0x84/0xe0
        __device_attach+0xa8/0x1b0
        device_initial_probe+0x14/0x20
        bus_probe_device+0xb0/0xb4
        device_add+0x574/0x738
        amba_device_add+0x40/0xac
        of_platform_bus_create+0x2b4/0x378
        of_platform_populate+0x50/0xfc
        of_platform_default_populate_init+0xd0/0xf0
        do_one_initcall+0x74/0x2f0
        kernel_init_freeable+0x28c/0x4dc
        kernel_init+0x24/0x1dc
        ret_from_fork+0x10/0x20

-> #0 (console_owner){..-.}-{0:0}:
        __lock_acquire+0x1318/0x20c4
        lock_acquire+0x1e8/0x318
        console_flush_all+0x1f8/0x500
        console_unlock+0x6c/0x130
        vprintk_emit+0x228/0x3a0
        vprintk_default+0x38/0x44
        vprintk+0xa4/0xc0
        _printk+0x5c/0x84
        pick_next_task_fair+0x28c/0x498
        __schedule+0x164/0xc40
        do_task_dead+0x54/0x58
        do_exit+0x61c/0x9e8
        do_group_exit+0x34/0x90
        __wake_up_parent+0x0/0x30
        invoke_syscall+0x48/0x114
        el0_svc_common.constprop.0+0x40/0xe0
        do_el0_svc_compat+0x1c/0x38
        el0_svc_compat+0x48/0xb4
        el0t_32_sync_handler+0x90/0x140
        el0t_32_sync+0x194/0x198

other info that might help us debug this:

Chain exists of:
   console_owner --> &p->pi_lock --> &rq->__lock

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&rq->__lock);
                                lock(&p->pi_lock);
                                lock(&rq->__lock);
   lock(console_owner);

  *** DEADLOCK ***

3 locks held by systemd-udevd/1187:
  #0: ffff5535ffdd2b18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0xe0/0xc40
  #1: ffffbcc2be0c4c30 (console_lock){+.+.}-{0:0}, at: 
vprintk_emit+0x11c/0x3a0
  #2: ffffbcc2be0c4c88 (console_srcu){....}-{0:0}, at: 
console_flush_all+0x7c/0x500

stack backtrace:
CPU: 1 PID: 1187 Comm: systemd-udevd Not tainted 6.6.0-rc4+ #7222
Hardware name: linux,dummy-virt (DT)
Call trace:
  dump_backtrace+0x98/0xf0
  show_stack+0x18/0x24
  dump_stack_lvl+0x60/0xac
  dump_stack+0x18/0x24
  print_circular_bug+0x290/0x370
  check_noncircular+0x15c/0x170
  __lock_acquire+0x1318/0x20c4
  lock_acquire+0x1e8/0x318
  console_flush_all+0x1f8/0x500
  console_unlock+0x6c/0x130
  vprintk_emit+0x228/0x3a0
  vprintk_default+0x38/0x44
  vprintk+0xa4/0xc0
  _printk+0x5c/0x84
  pick_next_task_fair+0x28c/0x498
  __schedule+0x164/0xc40
  do_task_dead+0x54/0x58
  do_exit+0x61c/0x9e8
  do_group_exit+0x34/0x90
  __wake_up_parent+0x0/0x30
  invoke_syscall+0x48/0x114
  el0_svc_common.constprop.0+0x40/0xe0
  do_el0_svc_compat+0x1c/0x38
  el0_svc_compat+0x48/0xb4
  el0t_32_sync_handler+0x90/0x140
  el0t_32_sync+0x194/0x198

The problem is probably elsewhere, but this scheduler change only 
revealed it in a fully reproducible way. Reverting $subject on top of 
linux-next hides the problem deep enough that I was not able to 
reproduce it. Let me know if there is anything I can do to help fixing 
this issue.

> ---
>   kernel/sched/fair.c | 72 ++++++++++++++++++++++++++++++++++++---------
>   1 file changed, 58 insertions(+), 14 deletions(-)
>
> ...

Best regards
  
Abel Wu Oct. 11, 2023, 12:12 p.m. UTC | #2
On 9/30/23 8:09 AM, Benjamin Segall Wrote:
> The old pick_eevdf could fail to find the actual earliest eligible
> deadline when it descended to the right looking for min_deadline, but it
> turned out that that min_deadline wasn't actually eligible. In that case
> we need to go back and search through any left branches we skipped
> looking for the actual best _eligible_ min_deadline.
> 
> This is more expensive, but still O(log n), and at worst should only
> involve descending two branches of the rbtree.
> 
> I've run this through a userspace stress test (thank you
> tools/lib/rbtree.c), so hopefully this implementation doesn't miss any
> corner cases.
> 
> Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
> Signed-off-by: Ben Segall <bsegall@google.com>
> ---
>   kernel/sched/fair.c | 72 ++++++++++++++++++++++++++++++++++++---------
>   1 file changed, 58 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0c31cda0712f..77e9440b8ab3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -864,18 +864,20 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
>    *
>    *  se->min_deadline = min(se->deadline, se->{left,right}->min_deadline)
>    *
>    * Which allows an EDF like search on (sub)trees.
>    */
> -static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> +static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
>   {
>   	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
>   	struct sched_entity *curr = cfs_rq->curr;
>   	struct sched_entity *best = NULL;
> +	struct sched_entity *best_left = NULL;
>   
>   	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>   		curr = NULL;
> +	best = curr;
>   
>   	/*
>   	 * Once selected, run a task until it either becomes non-eligible or
>   	 * until it gets a new slice. See the HACK in set_next_entity().
>   	 */
> @@ -892,45 +894,87 @@ static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>   			node = node->rb_left;
>   			continue;
>   		}
>   
>   		/*
> -		 * If this entity has an earlier deadline than the previous
> -		 * best, take this one. If it also has the earliest deadline
> -		 * of its subtree, we're done.
> +		 * Now we heap search eligible trees for the best (min_)deadline
>   		 */
> -		if (!best || deadline_gt(deadline, best, se)) {
> +		if (!best || deadline_gt(deadline, best, se))
>   			best = se;
> -			if (best->deadline == best->min_deadline)
> -				break;
> -		}
>   
>   		/*
> -		 * If the earlest deadline in this subtree is in the fully
> -		 * eligible left half of our space, go there.
> +		 * Every se in a left branch is eligible, keep track of the
> +		 * branch with the best min_deadline
>   		 */
> +		if (node->rb_left) {
> +			struct sched_entity *left = __node_2_se(node->rb_left);
> +
> +			if (!best_left || deadline_gt(min_deadline, best_left, left))
> +				best_left = left;
> +
> +			/*
> +			 * min_deadline is in the left branch. rb_left and all
> +			 * descendants are eligible, so immediately switch to the second
> +			 * loop.
> +			 */
> +			if (left->min_deadline == se->min_deadline)
> +				break;
> +		}
> +
> +		/* min_deadline is at this node, no need to look right */
> +		if (se->deadline == se->min_deadline)
> +			break;
> +
> +		/* else min_deadline is in the right branch. */
> +		node = node->rb_right;
> +	}
> +
> +	/*
> +	 * We ran into an eligible node which is itself the best.
> +	 * (Or nr_running == 0 and both are NULL)
> +	 */
> +	if (!best_left || (s64)(best_left->min_deadline - best->deadline) > 0)
> +		return best;
> +
> +	/*
> +	 * Now best_left and all of its children are eligible, and we are just
> +	 * looking for deadline == min_deadline
> +	 */
> +	node = &best_left->run_node;
> +	while (node) {
> +		struct sched_entity *se = __node_2_se(node);
> +
> +		/* min_deadline is the current node */
> +		if (se->deadline == se->min_deadline)
> +			return se;

IMHO it would be better tiebreak on vruntime by moving this hunk to ..

> +
> +		/* min_deadline is in the left branch */
>   		if (node->rb_left &&
>   		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
>   			node = node->rb_left;
>   			continue;
>   		}

.. here, thoughts?

>   
> +		/* else min_deadline is in the right branch */
>   		node = node->rb_right;
>   	}
> +	return NULL;

Why not 'best'? Since ..

> +}
>   
> -	if (!best || (curr && deadline_gt(deadline, best, curr)))
> -		best = curr;
> +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
> +{
> +	struct sched_entity *se = __pick_eevdf(cfs_rq);
>   
> -	if (unlikely(!best)) {
> +	if (!se) {
>   		struct sched_entity *left = __pick_first_entity(cfs_rq);

.. cfs_rq->curr isn't considered here.

>   		if (left) {
>   			pr_err("EEVDF scheduling fail, picking leftmost\n");
>   			return left;
>   		}
>   	}
>   
> -	return best;
> +	return se;
>   }
>   
>   #ifdef CONFIG_SCHED_DEBUG
>   struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
>   {

The implementation of __pick_eevdf() now is quite complicated which
makes it really hard to maintain. I'm trying my best to refactor this
part, hopefully can do some help. Below is only for illustration, I
need to test more.

static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
{
	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
	struct sched_entity *curr = cfs_rq->curr;
	struct sched_entity *best = NULL;
	struct sched_entity *cand = NULL;
	bool all_eligible = false;

	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
		curr = NULL;

	/*
	 * Once selected, run a task until it either becomes non-eligible or
	 * until it gets a new slice. See the HACK in set_next_entity().
	 */
	if (sched_feat(RUN_TO_PARITY) && curr && curr->vlag == curr->deadline)
		return curr;

	while (node) {
		struct sched_entity *se = __node_2_se(node);

		/*
		 * If this entity is not eligible, try the left subtree.
		 */
		if (!all_eligible && !entity_eligible(cfs_rq, se)) {
			node = node->rb_left;
			continue;
		}

		if (node->rb_left) {
			struct sched_entity *left = __node_2_se(node->rb_left);

			BUG_ON(left->min_deadline < se->min_deadline);

			/* Tiebreak on vruntime */
			if (left->min_deadline == se->min_deadline) {
				node = node->rb_left;
				all_eligible = true;
				continue;
			}

			if (!all_eligible) {
				/*
				 * We're going to search right subtree and the one
				 * with min_deadline can be non-eligible, so record
				 * the left subtree as a candidate.
				 */
				if (!cand || deadline_gt(min_deadline, cand, left))
					cand = left;
			}
		}

		/* min_deadline is at this node, no need to look right */
		if (se->deadline == se->min_deadline) {
			best = se;
			break;
		}

		node = node->rb_right;

		if (!node && cand) {
			node = cand;
			all_eligible = true;
			cand = NULL;
		}
	}

	if (!best || (curr && deadline_gt(deadline, best, curr)))
		best = curr;

	return best;
}
  
Peter Zijlstra Oct. 11, 2023, 1:14 p.m. UTC | #3
On Wed, Oct 11, 2023 at 08:12:09PM +0800, Abel Wu wrote:

> The implementation of __pick_eevdf() now is quite complicated which
> makes it really hard to maintain. I'm trying my best to refactor this
> part, hopefully can do some help. Below is only for illustration, I
> need to test more.

Well, the form it has now is somewhat close to the one in the paper. I
think Ben added one early break it doesn't have or something.

As the paper explains, you get two walks, one down the eligibility path,
and then one down the heap. I think the current code structure
represents that fairly well.

Very vague memories seem to suggest I thought to be clever some 10+
years ago when I wrote that other decent loop that Ben showed to be
broken (and woe on me for not verifying it when I resurrected the
patches, I rewrote pretty much everything else, except this lookup :/).
  
Benjamin Segall Oct. 11, 2023, 9:01 p.m. UTC | #4
Abel Wu <wuyun.abel@bytedance.com> writes:

> On 9/30/23 8:09 AM, Benjamin Segall Wrote:
>> +	/*
>> +	 * Now best_left and all of its children are eligible, and we are just
>> +	 * looking for deadline == min_deadline
>> +	 */
>> +	node = &best_left->run_node;
>> +	while (node) {
>> +		struct sched_entity *se = __node_2_se(node);
>> +
>> +		/* min_deadline is the current node */
>> +		if (se->deadline == se->min_deadline)
>> +			return se;
>
> IMHO it would be better tiebreak on vruntime by moving this hunk to ..
>
>> +
>> +		/* min_deadline is in the left branch */
>>   		if (node->rb_left &&
>>   		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
>>   			node = node->rb_left;
>>   			continue;
>>   		}
>
> .. here, thoughts?

Yeah, that should work and be better on the tiebreak (and my test code
agrees). There's an argument that the tiebreak will never really come up
and it's better to avoid the potential one extra cache line from
"__node_2_se(node->rb_left)->min_deadline" though.

>
>>   +		/* else min_deadline is in the right branch */
>>   		node = node->rb_right;
>>   	}
>> +	return NULL;
>
> Why not 'best'? Since ..

The only time this can happen is if the tree is corrupt. We only reach
this case if best_left is set at all (and best_left's min_deadline is
better than "best"'s, which includes curr). In that case getting an
error message is good, and in general I wasn't worrying about it much.

>
>> +}
>>   -	if (!best || (curr && deadline_gt(deadline, best, curr)))
>> -		best = curr;
>> +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>> +{
>> +	struct sched_entity *se = __pick_eevdf(cfs_rq);
>>   -	if (unlikely(!best)) {
>> +	if (!se) {
>>   		struct sched_entity *left = __pick_first_entity(cfs_rq);
>
> .. cfs_rq->curr isn't considered here.

That said, we should probably consider curr here in the error-case
fallback, if just as a "if (!left) left = cfs_rq->curr;"

>
>>   		if (left) {
>>   			pr_err("EEVDF scheduling fail, picking leftmost\n");
>>   			return left;
>>   		}
>>   	}
>>   -	return best;
>> +	return se;
>>   }
>>     #ifdef CONFIG_SCHED_DEBUG
>>   struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
>>   {
>
> The implementation of __pick_eevdf() now is quite complicated which
> makes it really hard to maintain. I'm trying my best to refactor this
> part, hopefully can do some help. Below is only for illustration, I
> need to test more.
>

A passing version of that single-loop code minus the cfs_rq->curr
handling is here (just pasting the curr handling back on the start/end
should do):

static struct sched_entity *pick_eevdf_abel(struct cfs_rq *cfs_rq)
{
	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
	struct sched_entity *best = NULL;
	struct sched_entity *cand = NULL;
	bool all_eligible = false;

	while (node || cand) {
		struct sched_entity *se = __node_2_se(node);
		if (!node) {
			BUG_ON(!cand);
			node = &cand->run_node;
			se = __node_2_se(node);
			all_eligible = true;
			cand = NULL;

			/*
			 * Our initial pass ran into an eligible node which is
			 * itself the best
			 */
			if (best && (s64)(se->min_deadline - best->deadline) > 0)
				break;
		}

		/*
		 * If this entity is not eligible, try the left subtree.
		 */
		if (!all_eligible && !entity_eligible(cfs_rq, se)) {
			node = node->rb_left;
			continue;
		}

		if (node->rb_left) {
			struct sched_entity *left = __node_2_se(node->rb_left);

			BUG_ON(left->min_deadline < se->min_deadline);

			/* Tiebreak on vruntime */
			if (left->min_deadline == se->min_deadline) {
				node = node->rb_left;
				all_eligible = true;
				continue;
			}

			if (!all_eligible) {
				/*
				 * We're going to search right subtree and the one
				 * with min_deadline can be non-eligible, so record
				 * the left subtree as a candidate.
				 */
				if (!cand || deadline_gt(min_deadline, cand, left))
					cand = left;
			}
		}

		if (!all_eligible && (!best || deadline_gt(deadline, best, se)))
			best = se;

		/* min_deadline is at this node, no need to look right */
		if (se->deadline == se->min_deadline) {
			if (all_eligible && (!best || deadline_gte(deadline, best, se)))
				best = se;
			if (!all_eligible && (!best || deadline_gt(deadline, best, se)))
				best = se;
			node = NULL;
			continue;
		}

		node = node->rb_right;
	}

	return best;
}

This does exactly as well on the tiebreak condition of vruntime as the
improved two-loop version you suggested. If we don't care about the
tiebreak we can replace the ugly if(all_eligible) if(!all_eligible) in
the last section with a single version that just uses deadline_gt (or
gte) throughout.

Personally with all the extra bits I added for correctness this winds up
definitely uglier than the two-loop version, but it might be possible to
clean it up further. (And a bunch of it is just personal style
preference in the first place)

I've also attached my ugly userspace EEVDF tester as an attachment,
hopefully I attached it in a correct mode to go through lkml.
  
Abel Wu Oct. 12, 2023, 10:04 a.m. UTC | #5
On 10/11/23 9:14 PM, Peter Zijlstra Wrote:
> On Wed, Oct 11, 2023 at 08:12:09PM +0800, Abel Wu wrote:
> 
> As the paper explains, you get two walks, one down the eligibility path,
> and then one down the heap. I think the current code structure
> represents that fairly well.

Yes, it does. I just wonder if the 2-step search is necessary, since
they obey the same rule of heap search:

   1) node->min_deadline > node->left->min_deadline
	1.1) BUG

   2) node->min_deadline = node->left->min_deadline
	2.1) go left if tiebreak on vruntime

   3) node->min_deadline < node->left->min_deadline
	3.1) return @node if it has min deadline, or
	3.2) go right

which gives:

	while (node) {
		if ((left = node->left)) {
			/* 1.1 */
			BUG_ON(left->min < node->min);

			/* 2.1 */
			if (left->min == node->min) {
				node = left;
				continue;
			}
		}

		/* 3.1 */
		if (node->deadline == node->min)
			return node;

		/* 3.2 */
		node = node->right;
	}

The above returns the entity with ealiest deadline (and with smallest
vruntime if have same deadline). Then comes with eligibility:

   0) it helps pruning the tree since the right subtree of a
      non-eligible node can't contain any eligible node.

   3.2.1) record left as a fallback iff the eligibility check
          is active, and saving the best one is enough since
          none of them contain non-eligible node, IOW the one
          with min deadline in the left tree must be eligible.

   4) the eligibility check ends immediately once go left from
      an eligible node, including switch to the fallback which
      is essentially is the 'left' of an eligible node.

   5) fallback to the candidate (if exists) if failed to find
      an eligible entity with earliest deadline.

which makes:

	candidate = NULL;
	need_check = true;

	while (node) {
		/* 0 */
		if (need_check && !eligible(node)) {
			node = node->left;
			goto next;
		}

		if ((left = node->left)) {
			/* 1.1 */
			BUG_ON(left->min < node->min);

			/* 2.1 */
			if (left->min == node->min) {
				node = left;
				/* 4 */
				need_check = false;
				continue;
			}

			/* 3.2.1 */
			if (need_check)
				candidate = better(candidate, left);
		}

		/* 3.1 */
		if (node->deadline == node->min)
			return node;

		/* 3.2 */
		node = node->right;
	next:
		/* 5 */
		if (!node && candidate) {
			node = candidate;
			need_check = false; /* 4 */
			candidate = NULL;
		}
	}

The search ends with a 'best' entity on the tree, comparing it with
curr which is out of tree makes things a whole.

But it's absolutely fine to me to honor the 2-step search given by
the paper if you think that is already clear enough :)

Best,
	Abel
  
Abel Wu Oct. 12, 2023, 10:25 a.m. UTC | #6
On 10/12/23 5:01 AM, Benjamin Segall Wrote:
> Abel Wu <wuyun.abel@bytedance.com> writes:
> 
>> On 9/30/23 8:09 AM, Benjamin Segall Wrote:
>>> +	/*
>>> +	 * Now best_left and all of its children are eligible, and we are just
>>> +	 * looking for deadline == min_deadline
>>> +	 */
>>> +	node = &best_left->run_node;
>>> +	while (node) {
>>> +		struct sched_entity *se = __node_2_se(node);
>>> +
>>> +		/* min_deadline is the current node */
>>> +		if (se->deadline == se->min_deadline)
>>> +			return se;
>>
>> IMHO it would be better tiebreak on vruntime by moving this hunk to ..
>>
>>> +
>>> +		/* min_deadline is in the left branch */
>>>    		if (node->rb_left &&
>>>    		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
>>>    			node = node->rb_left;
>>>    			continue;
>>>    		}
>>
>> .. here, thoughts?
> 
> Yeah, that should work and be better on the tiebreak (and my test code
> agrees). There's an argument that the tiebreak will never really come up
> and it's better to avoid the potential one extra cache line from
> "__node_2_se(node->rb_left)->min_deadline" though.

I see. Then probably do the same thing in the first loop?

> 
>>
>>>    +		/* else min_deadline is in the right branch */
>>>    		node = node->rb_right;
>>>    	}
>>> +	return NULL;
>>
>> Why not 'best'? Since ..
> 
> The only time this can happen is if the tree is corrupt. We only reach
> this case if best_left is set at all (and best_left's min_deadline is
> better than "best"'s, which includes curr). In that case getting an
> error message is good, and in general I wasn't worrying about it much.

Right.

> 
>>
>>> +}
>>>    -	if (!best || (curr && deadline_gt(deadline, best, curr)))
>>> -		best = curr;
>>> +static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
>>> +{
>>> +	struct sched_entity *se = __pick_eevdf(cfs_rq);
>>>    -	if (unlikely(!best)) {
>>> +	if (!se) {
>>>    		struct sched_entity *left = __pick_first_entity(cfs_rq);
>>
>> .. cfs_rq->curr isn't considered here.
> 
> That said, we should probably consider curr here in the error-case
> fallback, if just as a "if (!left) left = cfs_rq->curr;"

I don't think so as there must be some bugs in the scheduler, replacing
'pr_err' with 'BUG()' would be more appropriate.

> 
> 
> I've also attached my ugly userspace EEVDF tester as an attachment,
> hopefully I attached it in a correct mode to go through lkml.

Received. Thanks, Ben.
  
Benjamin Segall Oct. 12, 2023, 5:51 p.m. UTC | #7
Abel Wu <wuyun.abel@bytedance.com> writes:

> On 10/12/23 5:01 AM, Benjamin Segall Wrote:
>> Abel Wu <wuyun.abel@bytedance.com> writes:
>> 
>>> On 9/30/23 8:09 AM, Benjamin Segall Wrote:
>>>> +	/*
>>>> +	 * Now best_left and all of its children are eligible, and we are just
>>>> +	 * looking for deadline == min_deadline
>>>> +	 */
>>>> +	node = &best_left->run_node;
>>>> +	while (node) {
>>>> +		struct sched_entity *se = __node_2_se(node);
>>>> +
>>>> +		/* min_deadline is the current node */
>>>> +		if (se->deadline == se->min_deadline)
>>>> +			return se;
>>>
>>> IMHO it would be better tiebreak on vruntime by moving this hunk to ..
>>>
>>>> +
>>>> +		/* min_deadline is in the left branch */
>>>>    		if (node->rb_left &&
>>>>    		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
>>>>    			node = node->rb_left;
>>>>    			continue;
>>>>    		}
>>>
>>> .. here, thoughts?
>> Yeah, that should work and be better on the tiebreak (and my test code
>> agrees). There's an argument that the tiebreak will never really come up
>> and it's better to avoid the potential one extra cache line from
>> "__node_2_se(node->rb_left)->min_deadline" though.
>
> I see. Then probably do the same thing in the first loop?
>

We effectively do that already sorta by accident almost always -
computing best and best_left via deadline_gt rather than gte prioritizes
earlier elements, which always have a better vruntime.

Then when we do the best_left->min_deadline vs best->deadline
computation, we prioritize best_left, which is the one case it can be
wrong, we'd need an additional
"if (se->min_deadline == best->deadline &&
(s64)(se->vruntime - best->vruntime) > 0) return best;" check at the end
of the second loop.

(Though again I don't know how much this sort of never-going-to-happen
slight fairness improvement is worth compared to the extra bit of
overhead)
  
Abel Wu Oct. 13, 2023, 3:46 a.m. UTC | #8
On 10/13/23 1:51 AM, Benjamin Segall Wrote:
> Abel Wu <wuyun.abel@bytedance.com> writes:
> 
>> On 10/12/23 5:01 AM, Benjamin Segall Wrote:
>>> Abel Wu <wuyun.abel@bytedance.com> writes:
>>>
>>>> On 9/30/23 8:09 AM, Benjamin Segall Wrote:
>>>>> +	/*
>>>>> +	 * Now best_left and all of its children are eligible, and we are just
>>>>> +	 * looking for deadline == min_deadline
>>>>> +	 */
>>>>> +	node = &best_left->run_node;
>>>>> +	while (node) {
>>>>> +		struct sched_entity *se = __node_2_se(node);
>>>>> +
>>>>> +		/* min_deadline is the current node */
>>>>> +		if (se->deadline == se->min_deadline)
>>>>> +			return se;
>>>>
>>>> IMHO it would be better tiebreak on vruntime by moving this hunk to ..
>>>>
>>>>> +
>>>>> +		/* min_deadline is in the left branch */
>>>>>     		if (node->rb_left &&
>>>>>     		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
>>>>>     			node = node->rb_left;
>>>>>     			continue;
>>>>>     		}
>>>>
>>>> .. here, thoughts?
>>> Yeah, that should work and be better on the tiebreak (and my test code
>>> agrees). There's an argument that the tiebreak will never really come up
>>> and it's better to avoid the potential one extra cache line from
>>> "__node_2_se(node->rb_left)->min_deadline" though.
>>
>> I see. Then probably do the same thing in the first loop?
>>
> 
> We effectively do that already sorta by accident almost always -
> computing best and best_left via deadline_gt rather than gte prioritizes
> earlier elements, which always have a better vruntime.

Sorry for not clarifying clearly about the 'same thing'. What I meant
was to avoid touch left if the node itself has the min deadline.

@@ -894,6 +894,9 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
                 if (!best || deadline_gt(deadline, best, se))
                         best = se;

+               if (se->deadline == se->min_deadline)
+                       break;
+
                 /*
                  * Every se in a left branch is eligible, keep track of the
                  * branch with the best min_deadline
@@ -913,10 +916,6 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
                                 break;
                 }

-               /* min_deadline is at this node, no need to look right */
-               if (se->deadline == se->min_deadline)
-                       break;
-
                 /* else min_deadline is in the right branch. */
                 node = node->rb_right;
         }

(But still thanks for the convincing explanation on fairness.)

Best,
	Abel

> 
> Then when we do the best_left->min_deadline vs best->deadline
> computation, we prioritize best_left, which is the one case it can be
> wrong, we'd need an additional
> "if (se->min_deadline == best->deadline &&
> (s64)(se->vruntime - best->vruntime) > 0) return best;" check at the end
> of the second loop.
> 
> (Though again I don't know how much this sort of never-going-to-happen
> slight fairness improvement is worth compared to the extra bit of
> overhead)
  
Benjamin Segall Oct. 13, 2023, 4:51 p.m. UTC | #9
Abel Wu <wuyun.abel@bytedance.com> writes:

> On 10/13/23 1:51 AM, Benjamin Segall Wrote:
>> Abel Wu <wuyun.abel@bytedance.com> writes:
>> 
>>> On 10/12/23 5:01 AM, Benjamin Segall Wrote:
>>>> Abel Wu <wuyun.abel@bytedance.com> writes:
>>>>
>>>>> On 9/30/23 8:09 AM, Benjamin Segall Wrote:
>>>>>> +	/*
>>>>>> +	 * Now best_left and all of its children are eligible, and we are just
>>>>>> +	 * looking for deadline == min_deadline
>>>>>> +	 */
>>>>>> +	node = &best_left->run_node;
>>>>>> +	while (node) {
>>>>>> +		struct sched_entity *se = __node_2_se(node);
>>>>>> +
>>>>>> +		/* min_deadline is the current node */
>>>>>> +		if (se->deadline == se->min_deadline)
>>>>>> +			return se;
>>>>>
>>>>> IMHO it would be better tiebreak on vruntime by moving this hunk to ..
>>>>>
>>>>>> +
>>>>>> +		/* min_deadline is in the left branch */
>>>>>>     		if (node->rb_left &&
>>>>>>     		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
>>>>>>     			node = node->rb_left;
>>>>>>     			continue;
>>>>>>     		}
>>>>>
>>>>> .. here, thoughts?
>>>> Yeah, that should work and be better on the tiebreak (and my test code
>>>> agrees). There's an argument that the tiebreak will never really come up
>>>> and it's better to avoid the potential one extra cache line from
>>>> "__node_2_se(node->rb_left)->min_deadline" though.
>>>
>>> I see. Then probably do the same thing in the first loop?
>>>
>> We effectively do that already sorta by accident almost always -
>> computing best and best_left via deadline_gt rather than gte prioritizes
>> earlier elements, which always have a better vruntime.
>
> Sorry for not clarifying clearly about the 'same thing'. What I meant
> was to avoid touch left if the node itself has the min deadline.
>
> @@ -894,6 +894,9 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
>                 if (!best || deadline_gt(deadline, best, se))
>                         best = se;
>
> +               if (se->deadline == se->min_deadline)
> +                       break;
> +
>                 /*
>                  * Every se in a left branch is eligible, keep track of the
>                  * branch with the best min_deadline
> @@ -913,10 +916,6 @@ static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
>                                 break;
>                 }
>
> -               /* min_deadline is at this node, no need to look right */
> -               if (se->deadline == se->min_deadline)
> -                       break;
> -
>                 /* else min_deadline is in the right branch. */
>                 node = node->rb_right;
>         }
>
> (But still thanks for the convincing explanation on fairness.)
>

Ah, yes, in terms of optimizing performance rather than marginal
fairness, that would help.
  

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c31cda0712f..77e9440b8ab3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -864,18 +864,20 @@  struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
  *
  *  se->min_deadline = min(se->deadline, se->{left,right}->min_deadline)
  *
  * Which allows an EDF like search on (sub)trees.
  */
-static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+static struct sched_entity *__pick_eevdf(struct cfs_rq *cfs_rq)
 {
 	struct rb_node *node = cfs_rq->tasks_timeline.rb_root.rb_node;
 	struct sched_entity *curr = cfs_rq->curr;
 	struct sched_entity *best = NULL;
+	struct sched_entity *best_left = NULL;
 
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
+	best = curr;
 
 	/*
 	 * Once selected, run a task until it either becomes non-eligible or
 	 * until it gets a new slice. See the HACK in set_next_entity().
 	 */
@@ -892,45 +894,87 @@  static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
 			node = node->rb_left;
 			continue;
 		}
 
 		/*
-		 * If this entity has an earlier deadline than the previous
-		 * best, take this one. If it also has the earliest deadline
-		 * of its subtree, we're done.
+		 * Now we heap search eligible trees for the best (min_)deadline
 		 */
-		if (!best || deadline_gt(deadline, best, se)) {
+		if (!best || deadline_gt(deadline, best, se))
 			best = se;
-			if (best->deadline == best->min_deadline)
-				break;
-		}
 
 		/*
-		 * If the earlest deadline in this subtree is in the fully
-		 * eligible left half of our space, go there.
+		 * Every se in a left branch is eligible, keep track of the
+		 * branch with the best min_deadline
 		 */
+		if (node->rb_left) {
+			struct sched_entity *left = __node_2_se(node->rb_left);
+
+			if (!best_left || deadline_gt(min_deadline, best_left, left))
+				best_left = left;
+
+			/*
+			 * min_deadline is in the left branch. rb_left and all
+			 * descendants are eligible, so immediately switch to the second
+			 * loop.
+			 */
+			if (left->min_deadline == se->min_deadline)
+				break;
+		}
+
+		/* min_deadline is at this node, no need to look right */
+		if (se->deadline == se->min_deadline)
+			break;
+
+		/* else min_deadline is in the right branch. */
+		node = node->rb_right;
+	}
+
+	/*
+	 * We ran into an eligible node which is itself the best.
+	 * (Or nr_running == 0 and both are NULL)
+	 */
+	if (!best_left || (s64)(best_left->min_deadline - best->deadline) > 0)
+		return best;
+
+	/*
+	 * Now best_left and all of its children are eligible, and we are just
+	 * looking for deadline == min_deadline
+	 */
+	node = &best_left->run_node;
+	while (node) {
+		struct sched_entity *se = __node_2_se(node);
+
+		/* min_deadline is the current node */
+		if (se->deadline == se->min_deadline)
+			return se;
+
+		/* min_deadline is in the left branch */
 		if (node->rb_left &&
 		    __node_2_se(node->rb_left)->min_deadline == se->min_deadline) {
 			node = node->rb_left;
 			continue;
 		}
 
+		/* else min_deadline is in the right branch */
 		node = node->rb_right;
 	}
+	return NULL;
+}
 
-	if (!best || (curr && deadline_gt(deadline, best, curr)))
-		best = curr;
+static struct sched_entity *pick_eevdf(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *se = __pick_eevdf(cfs_rq);
 
-	if (unlikely(!best)) {
+	if (!se) {
 		struct sched_entity *left = __pick_first_entity(cfs_rq);
 		if (left) {
 			pr_err("EEVDF scheduling fail, picking leftmost\n");
 			return left;
 		}
 	}
 
-	return best;
+	return se;
 }
 
 #ifdef CONFIG_SCHED_DEBUG
 struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
 {