[3/4] sched/fair: introduce core_vruntime and core_min_vruntime

Message ID 20231115113341.13261-4-CruzZhao@linux.alibaba.com
State New
Headers
Series sched/core: fix cfs_prio_less |

Commit Message

cruzzhao Nov. 15, 2023, 11:33 a.m. UTC
  To compare the priority of sched_entity from different cpus of a core,
we introduce core_vruntime to struct sched_entity and core_min_vruntime
to struct cfs_rq.

cfs_rq->core->core_min_vruntime records the min vruntime of the cfs_rqs
of the same task_group among the core, and se->core_vruntime is the
vruntime relative to se->cfs_rq->core->core_min_vruntime.

Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
---
 include/linux/sched.h |  3 +++
 kernel/sched/fair.c   | 52 ++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h  |  1 +
 3 files changed, 51 insertions(+), 5 deletions(-)
  

Comments

Peter Zijlstra Nov. 15, 2023, 12:20 p.m. UTC | #1
On Wed, Nov 15, 2023 at 07:33:40PM +0800, Cruz Zhao wrote:
> To compare the priority of sched_entity from different cpus of a core,
> we introduce core_vruntime to struct sched_entity and core_min_vruntime
> to struct cfs_rq.
> 
> cfs_rq->core->core_min_vruntime records the min vruntime of the cfs_rqs
> of the same task_group among the core, and se->core_vruntime is the
> vruntime relative to se->cfs_rq->core->core_min_vruntime.

But that makes absolutely no sense. vruntime of different RQs can
advance at wildly different rates. Not to mention there's this random
offset between them.

No, this cannot be.
  
cruzzhao Nov. 15, 2023, 1:42 p.m. UTC | #2
在 2023/11/15 下午8:20, Peter Zijlstra 写道:
> On Wed, Nov 15, 2023 at 07:33:40PM +0800, Cruz Zhao wrote:
>> To compare the priority of sched_entity from different cpus of a core,
>> we introduce core_vruntime to struct sched_entity and core_min_vruntime
>> to struct cfs_rq.
>>
>> cfs_rq->core->core_min_vruntime records the min vruntime of the cfs_rqs
>> of the same task_group among the core, and se->core_vruntime is the
>> vruntime relative to se->cfs_rq->core->core_min_vruntime.
> 
> But that makes absolutely no sense. vruntime of different RQs can
> advance at wildly different rates. Not to mention there's this random
> offset between them.
> 
> No, this cannot be.

Force idle vruntime snapshot does the same thing, comparing
sea->vruntime - cfs_rqa->min_vruntime_fi with seb->vruntime -
cfs_rqb->min_vruntime_fi, while sea and seb may have wildly different rates.

Actually, cfs_rq->core->core_min_vruntime does the same thing as
cfs_rq->min_vruntime_fi, providing a baseline, but
cfs_rq->core->core_min_vruntime is more accurate.

I've tried to implement a fair enough mechanism of core_vruntime, but
it's too complex because of the weight, and it costs a lot. So this is a
compromise solution.

BTW, is there any other solutions to solve this problem?

Best,
Cruz Zhao
  
Peter Zijlstra Nov. 15, 2023, 3:22 p.m. UTC | #3
On Wed, Nov 15, 2023 at 09:42:13PM +0800, cruzzhao wrote:
> 
> 
> 在 2023/11/15 下午8:20, Peter Zijlstra 写道:
> > On Wed, Nov 15, 2023 at 07:33:40PM +0800, Cruz Zhao wrote:
> >> To compare the priority of sched_entity from different cpus of a core,
> >> we introduce core_vruntime to struct sched_entity and core_min_vruntime
> >> to struct cfs_rq.
> >>
> >> cfs_rq->core->core_min_vruntime records the min vruntime of the cfs_rqs
> >> of the same task_group among the core, and se->core_vruntime is the
> >> vruntime relative to se->cfs_rq->core->core_min_vruntime.
> > 
> > But that makes absolutely no sense. vruntime of different RQs can
> > advance at wildly different rates. Not to mention there's this random
> > offset between them.
> > 
> > No, this cannot be.
> 
> Force idle vruntime snapshot does the same thing, comparing
> sea->vruntime - cfs_rqa->min_vruntime_fi with seb->vruntime -
> cfs_rqb->min_vruntime_fi, while sea and seb may have wildly different rates.

But that subtracts a from a and b from b, it doesn't mix a and b.

Note that se->vruntime - cfs_rq->min_vruntime is a very poor
approximation of lag. We have actual lag now.

Note that:

  (sea - seb) + (min_fib - min_fia) =
  (sea - min_fia) + (min_fib - seb) =
  (sea - min_fia) - (seb - min_fib) =
  'lag'a - 'lag'b

It doesn't mix absolute a and b terms anywhere.

> Actually, cfs_rq->core->core_min_vruntime does the same thing as
> cfs_rq->min_vruntime_fi, providing a baseline, but
> cfs_rq->core->core_min_vruntime is more accurate.

min(cfs_rqa, cfs_rqb) is nonsense. And I can't see how min_vruntime_fi
would do anything like that.

> I've tried to implement a fair enough mechanism of core_vruntime, but
> it's too complex because of the weight, and it costs a lot. So this is a
> compromise solution.

'this' is complete nonsense and not motivated by any math.

> BTW, is there any other solutions to solve this problem?

Well, this is where it all started:

  https://lkml.kernel.org/r/20200506143506.GH5298%40hirez.programming.kicks-ass.net

The above lag thing is detailed in a follow up:

  https://lkml.kernel.org/r/20200515103844.GG2978%40hirez.programming.kicks-ass.net

Anyway, I think the first of those links has the start of the
multi-queue formalism, see the S_k+l bits. Work that out and see where
it ends.

I did go a bit further, but I've forgotten everything since, it's been
years.

Anyway, nothing like this goes in without a fairly solid bit of math in
the changelog to justify it.

Also, I think Joel complained about something like this at some point,
and he wanted to update the core tree more often, because IIRc his
observation was that things got stale or something.
  
kernel test robot Nov. 15, 2023, 8:51 p.m. UTC | #4
Hi Cruz,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.7-rc1 next-20231115]
[cannot apply to tip/sched/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Cruz-Zhao/sched-core-introduce-core_id-to-struct-rq/20231115-193559
base:   linus/master
patch link:    https://lore.kernel.org/r/20231115113341.13261-4-CruzZhao%40linux.alibaba.com
patch subject: [PATCH 3/4] sched/fair: introduce core_vruntime and core_min_vruntime
config: i386-buildonly-randconfig-005-20231116 (https://download.01.org/0day-ci/archive/20231116/202311160442.IBG98IUu-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231116/202311160442.IBG98IUu-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311160442.IBG98IUu-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from kernel/sched/fair.c:54:
   kernel/sched/fair.c: In function 'update_min_vruntime':
>> kernel/sched/fair.c:802:37: error: 'struct cfs_rq' has no member named 'core_min_vruntime_copy'; did you mean 'core_min_vruntime'?
     802 |         u64_u32_store(cfs_rq->core->core_min_vruntime,
         |                                     ^~~~~~~~~~~~~~~~~
   kernel/sched/sched.h:528:9: note: in definition of macro 'u64_u32_store_copy'
     528 |         copy = __val;                                                   \
         |         ^~~~
   kernel/sched/fair.c:802:9: note: in expansion of macro 'u64_u32_store'
     802 |         u64_u32_store(cfs_rq->core->core_min_vruntime,
         |         ^~~~~~~~~~~~~
   kernel/sched/fair.c: At top level:
   kernel/sched/fair.c:12462:6: warning: no previous prototype for 'sched_core_init_cfs_rq' [-Wmissing-prototypes]
   12462 | void sched_core_init_cfs_rq(struct task_group *tg, struct cfs_rq *cfs_rq)
         |      ^~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c: In function 'init_cfs_rq':
   kernel/sched/fair.c:12698:31: error: 'struct cfs_rq' has no member named 'core_min_vruntime_copy'; did you mean 'core_min_vruntime'?
   12698 |         u64_u32_store(cfs_rq->core_min_vruntime, (u64)(-(1LL << 20)));
         |                               ^~~~~~~~~~~~~~~~~
   kernel/sched/sched.h:528:9: note: in definition of macro 'u64_u32_store_copy'
     528 |         copy = __val;                                                   \
         |         ^~~~
   kernel/sched/fair.c:12698:9: note: in expansion of macro 'u64_u32_store'
   12698 |         u64_u32_store(cfs_rq->core_min_vruntime, (u64)(-(1LL << 20)));
         |         ^~~~~~~~~~~~~
   kernel/sched/fair.c: At top level:
   kernel/sched/fair.c:12985:6: warning: no previous prototype for 'free_fair_sched_group' [-Wmissing-prototypes]
   12985 | void free_fair_sched_group(struct task_group *tg) { }
         |      ^~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:12987:5: warning: no previous prototype for 'alloc_fair_sched_group' [-Wmissing-prototypes]
   12987 | int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
         |     ^~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:12992:6: warning: no previous prototype for 'online_fair_sched_group' [-Wmissing-prototypes]
   12992 | void online_fair_sched_group(struct task_group *tg) { }
         |      ^~~~~~~~~~~~~~~~~~~~~~~
   kernel/sched/fair.c:12994:6: warning: no previous prototype for 'unregister_fair_sched_group' [-Wmissing-prototypes]
   12994 | void unregister_fair_sched_group(struct task_group *tg) { }
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~~


vim +802 kernel/sched/fair.c

   772	
   773		if (curr) {
   774			if (curr->on_rq) {
   775				vruntime = curr->vruntime;
   776	#ifdef CONFIG_SCHED_CORE
   777				core_vruntime = curr->core_vruntime;
   778	#endif
   779			} else {
   780				curr = NULL;
   781			}
   782		}
   783	
   784		if (se) {
   785			if (!curr) {
   786				vruntime = se->vruntime;
   787	#ifdef CONFIG_SCHED_CORE
   788				core_vruntime = se->core_vruntime;
   789	#endif
   790			} else {
   791				vruntime = min_vruntime(vruntime, se->vruntime);
   792	#ifdef CONFIG_SCHED_CORE
   793				core_vruntime = min_vruntime(core_vruntime, se->core_vruntime);
   794	#endif
   795			}
   796		}
   797	
   798		/* ensure we never gain time by being placed backwards. */
   799		u64_u32_store(cfs_rq->min_vruntime,
   800			      __update_min_vruntime(cfs_rq, vruntime));
   801	#ifdef CONFIG_SCHED_CORE
 > 802		u64_u32_store(cfs_rq->core->core_min_vruntime,
   803			     __update_core_min_vruntime(cfs_rq->core, vruntime));
   804	#endif
   805	}
   806
  
cruzzhao Nov. 16, 2023, 6:38 a.m. UTC | #5
在 2023/11/15 下午11:22, Peter Zijlstra 写道:
> On Wed, Nov 15, 2023 at 09:42:13PM +0800, cruzzhao wrote:
>>
>>
>> 在 2023/11/15 下午8:20, Peter Zijlstra 写道:
>>> On Wed, Nov 15, 2023 at 07:33:40PM +0800, Cruz Zhao wrote:
>>>> To compare the priority of sched_entity from different cpus of a core,
>>>> we introduce core_vruntime to struct sched_entity and core_min_vruntime
>>>> to struct cfs_rq.
>>>>
>>>> cfs_rq->core->core_min_vruntime records the min vruntime of the cfs_rqs
>>>> of the same task_group among the core, and se->core_vruntime is the
>>>> vruntime relative to se->cfs_rq->core->core_min_vruntime.
>>>
>>> But that makes absolutely no sense. vruntime of different RQs can
>>> advance at wildly different rates. Not to mention there's this random
>>> offset between them.
>>>
>>> No, this cannot be.
>>
>> Force idle vruntime snapshot does the same thing, comparing
>> sea->vruntime - cfs_rqa->min_vruntime_fi with seb->vruntime -
>> cfs_rqb->min_vruntime_fi, while sea and seb may have wildly different rates.
> 
> But that subtracts a from a and b from b, it doesn't mix a and b.
> 
> Note that se->vruntime - cfs_rq->min_vruntime is a very poor
> approximation of lag. We have actual lag now.
> 
> Note that:
> 
>   (sea - seb) + (min_fib - min_fia) =
>   (sea - min_fia) + (min_fib - seb) =
>   (sea - min_fia) - (seb - min_fib) =
>   'lag'a - 'lag'b
> 
> It doesn't mix absolute a and b terms anywhere.
> 
>> Actually, cfs_rq->core->core_min_vruntime does the same thing as
>> cfs_rq->min_vruntime_fi, providing a baseline, but
>> cfs_rq->core->core_min_vruntime is more accurate.
> 
> min(cfs_rqa, cfs_rqb) is nonsense. And I can't see how min_vruntime_fi
> would do anything like that.
> 
>> I've tried to implement a fair enough mechanism of core_vruntime, but
>> it's too complex because of the weight, and it costs a lot. So this is a
>> compromise solution.
> 
> 'this' is complete nonsense and not motivated by any math.
> 
>> BTW, is there any other solutions to solve this problem?
> 
> Well, this is where it all started:
> 
>   https://lkml.kernel.org/r/20200506143506.GH5298%40hirez.programming.kicks-ass.net
> 
> The above lag thing is detailed in a follow up:
> 
>   https://lkml.kernel.org/r/20200515103844.GG2978%40hirez.programming.kicks-ass.net

Many thanks, I'll study the discussion about this.

> 
> Anyway, I think the first of those links has the start of the
> multi-queue formalism, see the S_k+l bits. Work that out and see where
> it ends.
> 
> I did go a bit further, but I've forgotten everything since, it's been
> years.
> 
> Anyway, nothing like this goes in without a fairly solid bit of math in
> the changelog to justify it.
> 
> Also, I think Joel complained about something like this at some point,
> and he wanted to update the core tree more often, because IIRc his
> observation was that things got stale or something.

Many thanks for reviewing. I'll think about this more comprehensively.

Best,
Cruz Zhao
  
cruzzhao Nov. 17, 2023, 2:48 a.m. UTC | #6
在 2023/11/15 下午11:22, Peter Zijlstra 写道:
> On Wed, Nov 15, 2023 at 09:42:13PM +0800, cruzzhao wrote:
>>
>>
>> 在 2023/11/15 下午8:20, Peter Zijlstra 写道:
>>> On Wed, Nov 15, 2023 at 07:33:40PM +0800, Cruz Zhao wrote:
>>>> To compare the priority of sched_entity from different cpus of a core,
>>>> we introduce core_vruntime to struct sched_entity and core_min_vruntime
>>>> to struct cfs_rq.
>>>>
>>>> cfs_rq->core->core_min_vruntime records the min vruntime of the cfs_rqs
>>>> of the same task_group among the core, and se->core_vruntime is the
>>>> vruntime relative to se->cfs_rq->core->core_min_vruntime.
>>>
>>> But that makes absolutely no sense. vruntime of different RQs can
>>> advance at wildly different rates. Not to mention there's this random
>>> offset between them.
>>>
>>> No, this cannot be.
>>
>> Force idle vruntime snapshot does the same thing, comparing
>> sea->vruntime - cfs_rqa->min_vruntime_fi with seb->vruntime -
>> cfs_rqb->min_vruntime_fi, while sea and seb may have wildly different rates.
> 
> But that subtracts a from a and b from b, it doesn't mix a and b.
> 
> Note that se->vruntime - cfs_rq->min_vruntime is a very poor
> approximation of lag. We have actual lag now.
> 
> Note that:
> 
>   (sea - seb) + (min_fib - min_fia) =
>   (sea - min_fia) + (min_fib - seb) =
>   (sea - min_fia) - (seb - min_fib) =
>   'lag'a - 'lag'b
> 
> It doesn't mix absolute a and b terms anywhere.
> 
>> Actually, cfs_rq->core->core_min_vruntime does the same thing as
>> cfs_rq->min_vruntime_fi, providing a baseline, but
>> cfs_rq->core->core_min_vruntime is more accurate.
> 
> min(cfs_rqa, cfs_rqb) is nonsense. And I can't see how min_vruntime_fi
> would do anything like that.
> 

Introducing core_vruntime and core_min_vruntime is a try to maintain a
single core wide cfs_rq, abstracting vruntime, and core_min_vruntime
doesn't equal to min(cfs_rqa, cfs_rqb).

Note that:
sea->core_vruntime - seb->core_vruntime =
sea->core_vruntime - seb->core_vruntime + core_min_vruntime -
core_min_cruntime =
(sea->core_vruntime - core_min_vruntime) - (seb->core_vruntime -
core_min_vruntime) =
'lag'a - 'lag'b

The problem about wildly different vruntime rates also happens with
vruntime snapshot. Consider the case that a core always force idle some
SMT, and the min_vruntime_fi will never update. In this case, 'lag'a and
'lag'b increase according to their respective weights in cfs, instead of
the core wide weights.

Afaic, there is no perfect solution or mechanism to solve this problem
yet, but I'll try.

Best,
Cruz Zhao
  

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 292c31697248..df481a8ebc07 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -562,6 +562,9 @@  struct sched_entity {
 	u64				sum_exec_runtime;
 	u64				prev_sum_exec_runtime;
 	u64				vruntime;
+#ifdef CONFIG_SCHED_CORE
+	u64				core_vruntime;
+#endif
 	s64				vlag;
 	u64				slice;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61cbaa3cc385..60b2fd437474 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -750,30 +750,58 @@  static u64 __update_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
 	return min_vruntime;
 }
 
+#ifdef CONFIG_SCHED_CORE
+static u64 __update_core_min_vruntime(struct cfs_rq *cfs_rq, u64 vruntime)
+{
+	u64 min_vruntime = cfs_rq->core_min_vruntime;
+	s64 delta = (s64)(vruntime - min_vruntime);
+
+	return delta > 0 ? vruntime : min_vruntime;
+}
+#endif
+
 static void update_min_vruntime(struct cfs_rq *cfs_rq)
 {
 	struct sched_entity *se = __pick_first_entity(cfs_rq);
 	struct sched_entity *curr = cfs_rq->curr;
 
 	u64 vruntime = cfs_rq->min_vruntime;
+#ifdef CONFIG_SCHED_CORE
+	u64 core_vruntime = cfs_rq->core->min_vruntime;
+#endif
 
 	if (curr) {
-		if (curr->on_rq)
+		if (curr->on_rq) {
 			vruntime = curr->vruntime;
-		else
+#ifdef CONFIG_SCHED_CORE
+			core_vruntime = curr->core_vruntime;
+#endif
+		} else {
 			curr = NULL;
+		}
 	}
 
 	if (se) {
-		if (!curr)
+		if (!curr) {
 			vruntime = se->vruntime;
-		else
+#ifdef CONFIG_SCHED_CORE
+			core_vruntime = se->core_vruntime;
+#endif
+		} else {
 			vruntime = min_vruntime(vruntime, se->vruntime);
+#ifdef CONFIG_SCHED_CORE
+			core_vruntime = min_vruntime(core_vruntime, se->core_vruntime);
+#endif
+		}
 	}
 
 	/* ensure we never gain time by being placed backwards. */
 	u64_u32_store(cfs_rq->min_vruntime,
 		      __update_min_vruntime(cfs_rq, vruntime));
+#ifdef CONFIG_SCHED_CORE
+	u64_u32_store(cfs_rq->core->core_min_vruntime,
+		     __update_core_min_vruntime(cfs_rq->core, vruntime));
+#endif
 }
 
 static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
@@ -1137,6 +1165,7 @@  static void update_curr(struct cfs_rq *cfs_rq)
 	struct sched_entity *curr = cfs_rq->curr;
 	u64 now = rq_clock_task(rq_of(cfs_rq));
 	u64 delta_exec;
+	u64 delta_exec_fair;
 
 	if (unlikely(!curr))
 		return;
@@ -1158,7 +1187,11 @@  static void update_curr(struct cfs_rq *cfs_rq)
 	curr->sum_exec_runtime += delta_exec;
 	schedstat_add(cfs_rq->exec_clock, delta_exec);
 
-	curr->vruntime += calc_delta_fair(delta_exec, curr);
+	delta_exec_fair = calc_delta_fair(delta_exec, curr);
+	curr->vruntime += delta_exec_fair;
+#ifdef CONFIG_SCHED_CORE
+	curr->core_vruntime += delta_exec_fair;
+#endif
 	update_deadline(cfs_rq, curr);
 	update_min_vruntime(cfs_rq);
 
@@ -5009,6 +5042,9 @@  static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
 	u64 vslice, vruntime = avg_vruntime(cfs_rq);
+#ifdef CONFIG_SCHED_CORE
+	u64 core_vruntime = cfs_rq->core->core_min_vruntime + vruntime - cfs_rq->min_vruntime;
+#endif
 	s64 lag = 0;
 
 	se->slice = sysctl_sched_base_slice;
@@ -5091,6 +5127,9 @@  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	}
 
 	se->vruntime = vruntime - lag;
+#ifdef CONFIG_SCHED_CORE
+	se->core_vruntime = core_vruntime - lag;
+#endif
 
 	/*
 	 * When joining the competition; the exisiting tasks will be,
@@ -12655,6 +12694,9 @@  void init_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	cfs_rq->tasks_timeline = RB_ROOT_CACHED;
 	u64_u32_store(cfs_rq->min_vruntime, (u64)(-(1LL << 20)));
+#ifdef CONFIG_SCHED_CORE
+	u64_u32_store(cfs_rq->core_min_vruntime, (u64)(-(1LL << 20)));
+#endif
 #ifdef CONFIG_SMP
 	raw_spin_lock_init(&cfs_rq->removed.lock);
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 62fca54223a1..f9d3701481f1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -545,6 +545,7 @@  struct cfs_rq {
 	u64			exec_clock;
 	u64			min_vruntime;
 #ifdef CONFIG_SCHED_CORE
+	u64			core_min_vruntime;
 	unsigned int		forceidle_seq;
 	u64			min_vruntime_fi;
 	struct cfs_rq		*core;