diff mbox series

sched/fair: Introduce priority load balance for CFS

Message ID	20221102035301.512892-1-zhangsong34@huawei.com
State	New
Headers	Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; From: Song Zhang <zhangsong34@huawei.com> To: <mingo@redhat.com>, <peterz@infradead.org>, <juri.lelli@redhat.com>, <vincent.guittot@linaro.org>, <mcgrof@kernel.org>, <keescook@chromium.org>, <yzaikin@google.com> CC: <dietmar.eggemann@arm.com>, <rostedt@goodmis.org>, <bsegall@google.com>, <mgorman@suse.de>, <bristot@redhat.com>, <vschneid@redhat.com>, <linux-kernel@vger.kernel.org>, <linux-fsdevel@vger.kernel.org>, Song Zhang <zhangsong34@huawei.com> Subject: [PATCH] sched/fair: Introduce priority load balance for CFS Date: Wed, 2 Nov 2022 11:53:01 +0800 Message-ID: <20221102035301.512892-1-zhangsong34@huawei.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII Precedence: bulk
Series	sched/fair: Introduce priority load balance for CFS \| sched/fair: Introduce priority load balance for CFS

Commit Message

Song Zhang Nov. 2, 2022, 3:53 a.m. UTC

  Add a new sysctl interface:
/proc/sys/kernel/sched_prio_load_balance_enabled

0: default behavior
1: enable priority load balance for CFS

For co-location with idle and non-idle tasks, when CFS do load balance,
it is reasonable to prefer migrating non-idle tasks and migrating idle
tasks lastly. This will reduce the interference by SCHED_IDLE tasks
as much as possible.

Testcase:
- Spawn large number of idle(SCHED_IDLE) tasks occupy CPUs
- Let non-idle tasks compete with idle tasks for CPU time.

Using schbench to test non-idle tasks latency:
$ ./schbench -m 1 -t 10 -r 30 -R 200

Test result:
1.Default behavior
Latency percentiles (usec) runtime 30 (s) (4562 total samples)
        50.0th: 62528 (2281 samples)
        75.0th: 623616 (1141 samples)
        90.0th: 764928 (687 samples)
        95.0th: 824320 (225 samples)
        *99.0th: 920576 (183 samples)
        99.5th: 953344 (23 samples)
        99.9th: 1008640 (18 samples)
        min=9, max=1074466

2.Enable priority load balance
Latency percentiles (usec) runtime 30 (s) (4391 total samples)
        50.0th: 22624 (2204 samples)
        75.0th: 48832 (1092 samples)
        90.0th: 85376 (657 samples)
        95.0th: 113280 (220 samples)
        *99.0th: 182528 (175 samples)
        99.5th: 206592 (22 samples)
        99.9th: 290304 (17 samples)
        min=6, max=351815

From percentile details, we see the benefit of priority load balance
that 95% of non-idle tasks latencies stays no more than 113ms, while
non-idle tasks latencies has got almost 50% over 600ms if priority
load balance not enabled.

Signed-off-by: Song Zhang <zhangsong34@huawei.com>
---
 include/linux/sched/sysctl.h |  4 +++
 init/Kconfig                 | 10 ++++++
 kernel/sched/core.c          |  3 ++
 kernel/sched/fair.c          | 61 +++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h         |  3 ++
 kernel/sysctl.c              | 11 +++++++
 6 files changed, 91 insertions(+), 1 deletion(-)

Comments

Vincent Guittot Nov. 2, 2022, 6:01 p.m. UTC | #1

On Wed, 2 Nov 2022 at 04:54, Song Zhang <zhangsong34@huawei.com> wrote:
>

This really looks like a v3 of
https://lore.kernel.org/all/20220810015636.3865248-1-zhangsong34@huawei.com/

Please keep versioning.

> Add a new sysctl interface:
> /proc/sys/kernel/sched_prio_load_balance_enabled

We don't want to add more sysctl knobs for the scheduler, we even
removed some. Knob usually means that you want to fix your use case
but the solution doesn't make sense for all cases.

>
> 0: default behavior
> 1: enable priority load balance for CFS
>
> For co-location with idle and non-idle tasks, when CFS do load balance,
> it is reasonable to prefer migrating non-idle tasks and migrating idle
> tasks lastly. This will reduce the interference by SCHED_IDLE tasks
> as much as possible.

I don't agree that it's always the best choice to migrate a non-idle task 1st.

CPU0 has 1 non idle task and CPU1 has 1 non idle task and hundreds of
idle task and there is an imbalance between the 2 CPUS: migrating the
non idle task from CPU1 to CPU0 is not the best choice

>
> Testcase:
> - Spawn large number of idle(SCHED_IDLE) tasks occupy CPUs

What do you mean by a large number ?

> - Let non-idle tasks compete with idle tasks for CPU time.
>
> Using schbench to test non-idle tasks latency:
> $ ./schbench -m 1 -t 10 -r 30 -R 200

How many CPUs do you have ?

>
> Test result:
> 1.Default behavior
> Latency percentiles (usec) runtime 30 (s) (4562 total samples)
>         50.0th: 62528 (2281 samples)
>         75.0th: 623616 (1141 samples)
>         90.0th: 764928 (687 samples)
>         95.0th: 824320 (225 samples)
>         *99.0th: 920576 (183 samples)
>         99.5th: 953344 (23 samples)
>         99.9th: 1008640 (18 samples)
>         min=9, max=1074466
>
> 2.Enable priority load balance
> Latency percentiles (usec) runtime 30 (s) (4391 total samples)
>         50.0th: 22624 (2204 samples)
>         75.0th: 48832 (1092 samples)
>         90.0th: 85376 (657 samples)
>         95.0th: 113280 (220 samples)
>         *99.0th: 182528 (175 samples)
>         99.5th: 206592 (22 samples)
>         99.9th: 290304 (17 samples)
>         min=6, max=351815
>
> From percentile details, we see the benefit of priority load balance
> that 95% of non-idle tasks latencies stays no more than 113ms, while

But even 113ms seems quite a large number if there is anything else
but 10 schbench workers and a bunch of idle threads that are running.

> non-idle tasks latencies has got almost 50% over 600ms if priority
> load balance not enabled.

Als have you considered enabling sched_feature LB_MIN ?

>
> Signed-off-by: Song Zhang <zhangsong34@huawei.com>
> ---
>  include/linux/sched/sysctl.h |  4 +++
>  init/Kconfig                 | 10 ++++++
>  kernel/sched/core.c          |  3 ++
>  kernel/sched/fair.c          | 61 +++++++++++++++++++++++++++++++++++-
>  kernel/sched/sched.h         |  3 ++
>  kernel/sysctl.c              | 11 +++++++
>  6 files changed, 91 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index 303ee7dd0c7e..9b3673269ecc 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -32,6 +32,10 @@ extern unsigned int sysctl_numa_balancing_promote_rate_limit;
>  #define sysctl_numa_balancing_mode     0
>  #endif
>
> +#ifdef CONFIG_SCHED_PRIO_LB
> +extern unsigned int sysctl_sched_prio_load_balance_enabled;
> +#endif
> +
>  int sysctl_numa_balancing(struct ctl_table *table, int write, void *buffer,
>                 size_t *lenp, loff_t *ppos);
>
> diff --git a/init/Kconfig b/init/Kconfig
> index 694f7c160c9c..b0dfe6701218 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1026,6 +1026,16 @@ config CFS_BANDWIDTH
>           restriction.
>           See Documentation/scheduler/sched-bwc.rst for more information.
>
> +config SCHED_PRIO_LB
> +       bool "Priority load balance for CFS"
> +       depends on SMP
> +       default n
> +       help
> +         This feature enable CFS priority load balance to reduce
> +         non-idle tasks latency interferenced by SCHED_IDLE tasks.
> +         It prefer migrating non-idle tasks firstly and
> +         migrating SCHED_IDLE tasks lastly.
> +
>  config RT_GROUP_SCHED
>         bool "Group scheduling for SCHED_RR/FIFO"
>         depends on CGROUP_SCHED
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5800b0623ff3..9be35431fdd5 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9731,6 +9731,9 @@ void __init sched_init(void)
>                 rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>
>                 INIT_LIST_HEAD(&rq->cfs_tasks);
> +#ifdef CONFIG_SCHED_PRIO_LB
> +               INIT_LIST_HEAD(&rq->cfs_idle_tasks);
> +#endif
>
>                 rq_attach_root(rq, &def_root_domain);
>  #ifdef CONFIG_NO_HZ_COMMON
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e4a0b8bd941c..bdeb04324f0c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -139,6 +139,10 @@ static int __init setup_sched_thermal_decay_shift(char *str)
>  }
>  __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
>
> +#ifdef CONFIG_SCHED_PRIO_LB
> +unsigned int sysctl_sched_prio_load_balance_enabled;
> +#endif
> +
>  #ifdef CONFIG_SMP
>  /*
>   * For asym packing, by default the lower numbered CPU has higher priority.
> @@ -3199,6 +3203,21 @@ static inline void update_scan_period(struct task_struct *p, int new_cpu)
>
>  #endif /* CONFIG_NUMA_BALANCING */
>
> +#ifdef CONFIG_SCHED_PRIO_LB
> +static void
> +adjust_rq_cfs_tasks(
> +       void (*list_op)(struct list_head *, struct list_head *),
> +       struct rq *rq,
> +       struct sched_entity *se)
> +{
> +       if (sysctl_sched_prio_load_balance_enabled &&
> +               task_has_idle_policy(task_of(se)))
> +               (*list_op)(&se->group_node, &rq->cfs_idle_tasks);
> +       else
> +               (*list_op)(&se->group_node, &rq->cfs_tasks);
> +}
> +#endif
> +
>  static void
>  account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> @@ -3208,7 +3227,11 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>                 struct rq *rq = rq_of(cfs_rq);
>
>                 account_numa_enqueue(rq, task_of(se));
> +#ifdef CONFIG_SCHED_PRIO_LB
> +               adjust_rq_cfs_tasks(list_add, rq, se);
> +#else
>                 list_add(&se->group_node, &rq->cfs_tasks);
> +#endif
>         }
>  #endif
>         cfs_rq->nr_running++;
> @@ -7631,7 +7654,11 @@ done: __maybe_unused;
>          * the list, so our cfs_tasks list becomes MRU
>          * one.
>          */
> +#ifdef CONFIG_SCHED_PRIO_LB
> +       adjust_rq_cfs_tasks(list_move, rq, &p->se);
> +#else
>         list_move(&p->se.group_node, &rq->cfs_tasks);
> +#endif
>  #endif
>
>         if (hrtick_enabled_fair(rq))
> @@ -8156,11 +8183,18 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
>  static struct task_struct *detach_one_task(struct lb_env *env)
>  {
>         struct task_struct *p;
> +       struct list_head *tasks = &env->src_rq->cfs_tasks;
> +#ifdef CONFIG_SCHED_PRIO_LB
> +       bool has_detach_idle_tasks = false;
> +#endif
>
>         lockdep_assert_rq_held(env->src_rq);
>
> +#ifdef CONFIG_SCHED_PRIO_LB
> +again:
> +#endif
>         list_for_each_entry_reverse(p,
> -                       &env->src_rq->cfs_tasks, se.group_node) {
> +                       tasks, se.group_node) {
>                 if (!can_migrate_task(p, env))
>                         continue;
>
> @@ -8175,6 +8209,13 @@ static struct task_struct *detach_one_task(struct lb_env *env)
>                 schedstat_inc(env->sd->lb_gained[env->idle]);
>                 return p;
>         }
> +#ifdef CONFIG_SCHED_PRIO_LB
> +       if (sysctl_sched_prio_load_balance_enabled && !has_detach_idle_tasks) {
> +               has_detach_idle_tasks = true;
> +               tasks = &env->src_rq->cfs_idle_tasks;
> +               goto again;
> +       }
> +#endif
>         return NULL;
>  }
>
> @@ -8190,6 +8231,9 @@ static int detach_tasks(struct lb_env *env)
>         unsigned long util, load;
>         struct task_struct *p;
>         int detached = 0;
> +#ifdef CONFIG_SCHED_PRIO_LB
> +       bool has_detach_idle_tasks = false;
> +#endif
>
>         lockdep_assert_rq_held(env->src_rq);
>
> @@ -8205,6 +8249,9 @@ static int detach_tasks(struct lb_env *env)
>         if (env->imbalance <= 0)
>                 return 0;
>
> +#ifdef CONFIG_SCHED_PRIO_LB
> +again:
> +#endif
>         while (!list_empty(tasks)) {
>                 /*
>                  * We don't want to steal all, otherwise we may be treated likewise,
> @@ -8310,6 +8357,14 @@ static int detach_tasks(struct lb_env *env)
>                 list_move(&p->se.group_node, tasks);
>         }
>
> +#ifdef CONFIG_SCHED_PRIO_LB
> +       if (sysctl_sched_prio_load_balance_enabled &&
> +               !has_detach_idle_tasks && env->imbalance > 0) {
> +               has_detach_idle_tasks = true;
> +               tasks = &env->src_rq->cfs_idle_tasks;
> +               goto again;
> +       }
> +#endif
>         /*
>          * Right now, this is one of only two places we collect this stat
>          * so we can safely collect detach_one_task() stats here rather
> @@ -11814,7 +11869,11 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
>                  * Move the next running task to the front of the list, so our
>                  * cfs_tasks list becomes MRU one.
>                  */
> +#ifdef CONFIG_SCHED_PRIO_LB
> +               adjust_rq_cfs_tasks(list_move, rq, se);
> +#else
>                 list_move(&se->group_node, &rq->cfs_tasks);
> +#endif
>         }
>  #endif
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 1644242ecd11..1b831c05ba30 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1053,6 +1053,9 @@ struct rq {
>         int                     online;
>
>         struct list_head cfs_tasks;
> +#ifdef CONFIG_SCHED_PRIO_LB
> +       struct list_head cfs_idle_tasks;
> +#endif
>
>         struct sched_avg        avg_rt;
>         struct sched_avg        avg_dl;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 188c305aeb8b..5fc0f9ffb675 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2090,6 +2090,17 @@ static struct ctl_table kern_table[] = {
>                 .extra1         = SYSCTL_ONE,
>                 .extra2         = SYSCTL_INT_MAX,
>         },
> +#endif
> +#ifdef CONFIG_SCHED_PRIO_LB
> +       {
> +               .procname       = "sched_prio_load_balance_enabled",
> +               .data           = &sysctl_sched_prio_load_balance_enabled,
> +               .maxlen         = sizeof(unsigned int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,
> +               .extra1         = SYSCTL_ZERO,
> +               .extra2         = SYSCTL_ONE,
> +       },
>  #endif
>         { }
>  };
> --
> 2.27.0
>

Song Zhang Nov. 3, 2022, 3:01 a.m. UTC | #2

Thanks for your reply!

On 2022/11/3 2:01, Vincent Guittot wrote:
> On Wed, 2 Nov 2022 at 04:54, Song Zhang <zhangsong34@huawei.com> wrote:
>>
> 
> This really looks like a v3 of
> https://lore.kernel.org/all/20220810015636.3865248-1-zhangsong34@huawei.com/
> 
> Please keep versioning.
> 
>> Add a new sysctl interface:
>> /proc/sys/kernel/sched_prio_load_balance_enabled
> 
> We don't want to add more sysctl knobs for the scheduler, we even
> removed some. Knob usually means that you want to fix your use case
> but the solution doesn't make sense for all cases.
> 

OK, I will remove this knobs later.

>>
>> 0: default behavior
>> 1: enable priority load balance for CFS
>>
>> For co-location with idle and non-idle tasks, when CFS do load balance,
>> it is reasonable to prefer migrating non-idle tasks and migrating idle
>> tasks lastly. This will reduce the interference by SCHED_IDLE tasks
>> as much as possible.
> 
> I don't agree that it's always the best choice to migrate a non-idle task 1st.
> 
> CPU0 has 1 non idle task and CPU1 has 1 non idle task and hundreds of
> idle task and there is an imbalance between the 2 CPUS: migrating the
> non idle task from CPU1 to CPU0 is not the best choice
> 

If the non idle task on CPU1 is running or cache hot, it cannot be 
migrated and idle tasks can also be migrated from CPU1 to CPU0. So I 
think it does not matter.

>>
>> Testcase:
>> - Spawn large number of idle(SCHED_IDLE) tasks occupy CPUs
> 
> What do you mean by a large number ?
> 
>> - Let non-idle tasks compete with idle tasks for CPU time.
>>
>> Using schbench to test non-idle tasks latency:
>> $ ./schbench -m 1 -t 10 -r 30 -R 200
> 
> How many CPUs do you have ?
> 

OK, some details may not be mentioned.
My virtual machine has 8 CPUs running with a schbench process and 5000 
idle tasks. The idle task is a while dead loop process below:

$ cat idle_process.c
int main()
{
         int i = 0;
         while(1) {
                 usleep(500);
                 for(i = 0; i < 1000000; i++);
         }
}

You can compile and spawn 5000 idle(SCHED_IDLE) tasks occupying 8 CPUs 
and execute schbench command to test it.

>>
>> Test result:
>> 1.Default behavior
>> Latency percentiles (usec) runtime 30 (s) (4562 total samples)
>>          50.0th: 62528 (2281 samples)
>>          75.0th: 623616 (1141 samples)
>>          90.0th: 764928 (687 samples)
>>          95.0th: 824320 (225 samples)
>>          *99.0th: 920576 (183 samples)
>>          99.5th: 953344 (23 samples)
>>          99.9th: 1008640 (18 samples)
>>          min=9, max=1074466
>>
>> 2.Enable priority load balance
>> Latency percentiles (usec) runtime 30 (s) (4391 total samples)
>>          50.0th: 22624 (2204 samples)
>>          75.0th: 48832 (1092 samples)
>>          90.0th: 85376 (657 samples)
>>          95.0th: 113280 (220 samples)
>>          *99.0th: 182528 (175 samples)
>>          99.5th: 206592 (22 samples)
>>          99.9th: 290304 (17 samples)
>>          min=6, max=351815
>>
>>  From percentile details, we see the benefit of priority load balance
>> that 95% of non-idle tasks latencies stays no more than 113ms, while
> 
> But even 113ms seems quite a large number if there is anything else
> but 10 schbench workers and a bunch of idle threads that are running.
> 
>> non-idle tasks latencies has got almost 50% over 600ms if priority
>> load balance not enabled.
> 
> Als have you considered enabling sched_feature LB_MIN ?
> 

I have tried to echo LB_MIN > /sys/kernel/debug/sched/features, but this 
feature seems make no sense.

>>
>> Signed-off-by: Song Zhang <zhangsong34@huawei.com>
>> ---
>>   include/linux/sched/sysctl.h |  4 +++
>>   init/Kconfig                 | 10 ++++++
>>   kernel/sched/core.c          |  3 ++
>>   kernel/sched/fair.c          | 61 +++++++++++++++++++++++++++++++++++-
>>   kernel/sched/sched.h         |  3 ++
>>   kernel/sysctl.c              | 11 +++++++
>>   6 files changed, 91 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
>> index 303ee7dd0c7e..9b3673269ecc 100644
>> --- a/include/linux/sched/sysctl.h
>> +++ b/include/linux/sched/sysctl.h
>> @@ -32,6 +32,10 @@ extern unsigned int sysctl_numa_balancing_promote_rate_limit;
>>   #define sysctl_numa_balancing_mode     0
>>   #endif
>>
>> +#ifdef CONFIG_SCHED_PRIO_LB
>> +extern unsigned int sysctl_sched_prio_load_balance_enabled;
>> +#endif
>> +
>>   int sysctl_numa_balancing(struct ctl_table *table, int write, void *buffer,
>>                  size_t *lenp, loff_t *ppos);
>>
>> diff --git a/init/Kconfig b/init/Kconfig
>> index 694f7c160c9c..b0dfe6701218 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1026,6 +1026,16 @@ config CFS_BANDWIDTH
>>            restriction.
>>            See Documentation/scheduler/sched-bwc.rst for more information.
>>
>> +config SCHED_PRIO_LB
>> +       bool "Priority load balance for CFS"
>> +       depends on SMP
>> +       default n
>> +       help
>> +         This feature enable CFS priority load balance to reduce
>> +         non-idle tasks latency interferenced by SCHED_IDLE tasks.
>> +         It prefer migrating non-idle tasks firstly and
>> +         migrating SCHED_IDLE tasks lastly.
>> +
>>   config RT_GROUP_SCHED
>>          bool "Group scheduling for SCHED_RR/FIFO"
>>          depends on CGROUP_SCHED
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 5800b0623ff3..9be35431fdd5 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -9731,6 +9731,9 @@ void __init sched_init(void)
>>                  rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>>
>>                  INIT_LIST_HEAD(&rq->cfs_tasks);
>> +#ifdef CONFIG_SCHED_PRIO_LB
>> +               INIT_LIST_HEAD(&rq->cfs_idle_tasks);
>> +#endif
>>
>>                  rq_attach_root(rq, &def_root_domain);
>>   #ifdef CONFIG_NO_HZ_COMMON
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e4a0b8bd941c..bdeb04324f0c 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -139,6 +139,10 @@ static int __init setup_sched_thermal_decay_shift(char *str)
>>   }
>>   __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
>>
>> +#ifdef CONFIG_SCHED_PRIO_LB
>> +unsigned int sysctl_sched_prio_load_balance_enabled;
>> +#endif
>> +
>>   #ifdef CONFIG_SMP
>>   /*
>>    * For asym packing, by default the lower numbered CPU has higher priority.
>> @@ -3199,6 +3203,21 @@ static inline void update_scan_period(struct task_struct *p, int new_cpu)
>>
>>   #endif /* CONFIG_NUMA_BALANCING */
>>
>> +#ifdef CONFIG_SCHED_PRIO_LB
>> +static void
>> +adjust_rq_cfs_tasks(
>> +       void (*list_op)(struct list_head *, struct list_head *),
>> +       struct rq *rq,
>> +       struct sched_entity *se)
>> +{
>> +       if (sysctl_sched_prio_load_balance_enabled &&
>> +               task_has_idle_policy(task_of(se)))
>> +               (*list_op)(&se->group_node, &rq->cfs_idle_tasks);
>> +       else
>> +               (*list_op)(&se->group_node, &rq->cfs_tasks);
>> +}
>> +#endif
>> +
>>   static void
>>   account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>   {
>> @@ -3208,7 +3227,11 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>                  struct rq *rq = rq_of(cfs_rq);
>>
>>                  account_numa_enqueue(rq, task_of(se));
>> +#ifdef CONFIG_SCHED_PRIO_LB
>> +               adjust_rq_cfs_tasks(list_add, rq, se);
>> +#else
>>                  list_add(&se->group_node, &rq->cfs_tasks);
>> +#endif
>>          }
>>   #endif
>>          cfs_rq->nr_running++;
>> @@ -7631,7 +7654,11 @@ done: __maybe_unused;
>>           * the list, so our cfs_tasks list becomes MRU
>>           * one.
>>           */
>> +#ifdef CONFIG_SCHED_PRIO_LB
>> +       adjust_rq_cfs_tasks(list_move, rq, &p->se);
>> +#else
>>          list_move(&p->se.group_node, &rq->cfs_tasks);
>> +#endif
>>   #endif
>>
>>          if (hrtick_enabled_fair(rq))
>> @@ -8156,11 +8183,18 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
>>   static struct task_struct *detach_one_task(struct lb_env *env)
>>   {
>>          struct task_struct *p;
>> +       struct list_head *tasks = &env->src_rq->cfs_tasks;
>> +#ifdef CONFIG_SCHED_PRIO_LB
>> +       bool has_detach_idle_tasks = false;
>> +#endif
>>
>>          lockdep_assert_rq_held(env->src_rq);
>>
>> +#ifdef CONFIG_SCHED_PRIO_LB
>> +again:
>> +#endif
>>          list_for_each_entry_reverse(p,
>> -                       &env->src_rq->cfs_tasks, se.group_node) {
>> +                       tasks, se.group_node) {
>>                  if (!can_migrate_task(p, env))
>>                          continue;
>>
>> @@ -8175,6 +8209,13 @@ static struct task_struct *detach_one_task(struct lb_env *env)
>>                  schedstat_inc(env->sd->lb_gained[env->idle]);
>>                  return p;
>>          }
>> +#ifdef CONFIG_SCHED_PRIO_LB
>> +       if (sysctl_sched_prio_load_balance_enabled && !has_detach_idle_tasks) {
>> +               has_detach_idle_tasks = true;
>> +               tasks = &env->src_rq->cfs_idle_tasks;
>> +               goto again;
>> +       }
>> +#endif
>>          return NULL;
>>   }
>>
>> @@ -8190,6 +8231,9 @@ static int detach_tasks(struct lb_env *env)
>>          unsigned long util, load;
>>          struct task_struct *p;
>>          int detached = 0;
>> +#ifdef CONFIG_SCHED_PRIO_LB
>> +       bool has_detach_idle_tasks = false;
>> +#endif
>>
>>          lockdep_assert_rq_held(env->src_rq);
>>
>> @@ -8205,6 +8249,9 @@ static int detach_tasks(struct lb_env *env)
>>          if (env->imbalance <= 0)
>>                  return 0;
>>
>> +#ifdef CONFIG_SCHED_PRIO_LB
>> +again:
>> +#endif
>>          while (!list_empty(tasks)) {
>>                  /*
>>                   * We don't want to steal all, otherwise we may be treated likewise,
>> @@ -8310,6 +8357,14 @@ static int detach_tasks(struct lb_env *env)
>>                  list_move(&p->se.group_node, tasks);
>>          }
>>
>> +#ifdef CONFIG_SCHED_PRIO_LB
>> +       if (sysctl_sched_prio_load_balance_enabled &&
>> +               !has_detach_idle_tasks && env->imbalance > 0) {
>> +               has_detach_idle_tasks = true;
>> +               tasks = &env->src_rq->cfs_idle_tasks;
>> +               goto again;
>> +       }
>> +#endif
>>          /*
>>           * Right now, this is one of only two places we collect this stat
>>           * so we can safely collect detach_one_task() stats here rather
>> @@ -11814,7 +11869,11 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
>>                   * Move the next running task to the front of the list, so our
>>                   * cfs_tasks list becomes MRU one.
>>                   */
>> +#ifdef CONFIG_SCHED_PRIO_LB
>> +               adjust_rq_cfs_tasks(list_move, rq, se);
>> +#else
>>                  list_move(&se->group_node, &rq->cfs_tasks);
>> +#endif
>>          }
>>   #endif
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 1644242ecd11..1b831c05ba30 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1053,6 +1053,9 @@ struct rq {
>>          int                     online;
>>
>>          struct list_head cfs_tasks;
>> +#ifdef CONFIG_SCHED_PRIO_LB
>> +       struct list_head cfs_idle_tasks;
>> +#endif
>>
>>          struct sched_avg        avg_rt;
>>          struct sched_avg        avg_dl;
>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>> index 188c305aeb8b..5fc0f9ffb675 100644
>> --- a/kernel/sysctl.c
>> +++ b/kernel/sysctl.c
>> @@ -2090,6 +2090,17 @@ static struct ctl_table kern_table[] = {
>>                  .extra1         = SYSCTL_ONE,
>>                  .extra2         = SYSCTL_INT_MAX,
>>          },
>> +#endif
>> +#ifdef CONFIG_SCHED_PRIO_LB
>> +       {
>> +               .procname       = "sched_prio_load_balance_enabled",
>> +               .data           = &sysctl_sched_prio_load_balance_enabled,
>> +               .maxlen         = sizeof(unsigned int),
>> +               .mode           = 0644,
>> +               .proc_handler   = proc_dointvec_minmax,
>> +               .extra1         = SYSCTL_ZERO,
>> +               .extra2         = SYSCTL_ONE,
>> +       },
>>   #endif
>>          { }
>>   };
>> --
>> 2.27.0
>>
> .

Vincent Guittot Nov. 3, 2022, 8:33 a.m. UTC | #3

On Thu, 3 Nov 2022 at 04:01, Song Zhang <zhangsong34@huawei.com> wrote:
>
> Thanks for your reply!
>
> On 2022/11/3 2:01, Vincent Guittot wrote:
> > On Wed, 2 Nov 2022 at 04:54, Song Zhang <zhangsong34@huawei.com> wrote:
> >>
> >
> > This really looks like a v3 of
> > https://lore.kernel.org/all/20220810015636.3865248-1-zhangsong34@huawei.com/
> >
> > Please keep versioning.
> >
> >> Add a new sysctl interface:
> >> /proc/sys/kernel/sched_prio_load_balance_enabled
> >
> > We don't want to add more sysctl knobs for the scheduler, we even
> > removed some. Knob usually means that you want to fix your use case
> > but the solution doesn't make sense for all cases.
> >
>
> OK, I will remove this knobs later.
>
> >>
> >> 0: default behavior
> >> 1: enable priority load balance for CFS
> >>
> >> For co-location with idle and non-idle tasks, when CFS do load balance,
> >> it is reasonable to prefer migrating non-idle tasks and migrating idle
> >> tasks lastly. This will reduce the interference by SCHED_IDLE tasks
> >> as much as possible.
> >
> > I don't agree that it's always the best choice to migrate a non-idle task 1st.
> >
> > CPU0 has 1 non idle task and CPU1 has 1 non idle task and hundreds of
> > idle task and there is an imbalance between the 2 CPUS: migrating the
> > non idle task from CPU1 to CPU0 is not the best choice
> >
>
> If the non idle task on CPU1 is running or cache hot, it cannot be
> migrated and idle tasks can also be migrated from CPU1 to CPU0. So I
> think it does not matter.

What I mean is that migrating non idle tasks first is not a universal
win and not always what we want.

>
> >>
> >> Testcase:
> >> - Spawn large number of idle(SCHED_IDLE) tasks occupy CPUs
> >
> > What do you mean by a large number ?
> >
> >> - Let non-idle tasks compete with idle tasks for CPU time.
> >>
> >> Using schbench to test non-idle tasks latency:
> >> $ ./schbench -m 1 -t 10 -r 30 -R 200
> >
> > How many CPUs do you have ?
> >
>
> OK, some details may not be mentioned.
> My virtual machine has 8 CPUs running with a schbench process and 5000
> idle tasks. The idle task is a while dead loop process below:

How can you care about latency when you start 10 workers on 8 vCPUs
with 5000 non idle threads ?

>
> $ cat idle_process.c
> int main()
> {
>          int i = 0;
>          while(1) {
>                  usleep(500);
>                  for(i = 0; i < 1000000; i++);
>          }
> }
>
> You can compile and spawn 5000 idle(SCHED_IDLE) tasks occupying 8 CPUs
> and execute schbench command to test it.
>
> >>
> >> Test result:
> >> 1.Default behavior
> >> Latency percentiles (usec) runtime 30 (s) (4562 total samples)
> >>          50.0th: 62528 (2281 samples)
> >>          75.0th: 623616 (1141 samples)
> >>          90.0th: 764928 (687 samples)
> >>          95.0th: 824320 (225 samples)
> >>          *99.0th: 920576 (183 samples)
> >>          99.5th: 953344 (23 samples)
> >>          99.9th: 1008640 (18 samples)
> >>          min=9, max=1074466
> >>
> >> 2.Enable priority load balance
> >> Latency percentiles (usec) runtime 30 (s) (4391 total samples)
> >>          50.0th: 22624 (2204 samples)
> >>          75.0th: 48832 (1092 samples)
> >>          90.0th: 85376 (657 samples)
> >>          95.0th: 113280 (220 samples)
> >>          *99.0th: 182528 (175 samples)
> >>          99.5th: 206592 (22 samples)
> >>          99.9th: 290304 (17 samples)
> >>          min=6, max=351815
> >>
> >>  From percentile details, we see the benefit of priority load balance
> >> that 95% of non-idle tasks latencies stays no more than 113ms, while
> >
> > But even 113ms seems quite a large number if there is anything else
> > but 10 schbench workers and a bunch of idle threads that are running.
> >
> >> non-idle tasks latencies has got almost 50% over 600ms if priority
> >> load balance not enabled.
> >
> > Als have you considered enabling sched_feature LB_MIN ?
> >
>
> I have tried to echo LB_MIN > /sys/kernel/debug/sched/features, but this
> feature seems make no sense.
>
> >>
> >> Signed-off-by: Song Zhang <zhangsong34@huawei.com>
> >> ---
> >>   include/linux/sched/sysctl.h |  4 +++
> >>   init/Kconfig                 | 10 ++++++
> >>   kernel/sched/core.c          |  3 ++
> >>   kernel/sched/fair.c          | 61 +++++++++++++++++++++++++++++++++++-
> >>   kernel/sched/sched.h         |  3 ++
> >>   kernel/sysctl.c              | 11 +++++++
> >>   6 files changed, 91 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> >> index 303ee7dd0c7e..9b3673269ecc 100644
> >> --- a/include/linux/sched/sysctl.h
> >> +++ b/include/linux/sched/sysctl.h
> >> @@ -32,6 +32,10 @@ extern unsigned int sysctl_numa_balancing_promote_rate_limit;
> >>   #define sysctl_numa_balancing_mode     0
> >>   #endif
> >>
> >> +#ifdef CONFIG_SCHED_PRIO_LB
> >> +extern unsigned int sysctl_sched_prio_load_balance_enabled;
> >> +#endif
> >> +
> >>   int sysctl_numa_balancing(struct ctl_table *table, int write, void *buffer,
> >>                  size_t *lenp, loff_t *ppos);
> >>
> >> diff --git a/init/Kconfig b/init/Kconfig
> >> index 694f7c160c9c..b0dfe6701218 100644
> >> --- a/init/Kconfig
> >> +++ b/init/Kconfig
> >> @@ -1026,6 +1026,16 @@ config CFS_BANDWIDTH
> >>            restriction.
> >>            See Documentation/scheduler/sched-bwc.rst for more information.
> >>
> >> +config SCHED_PRIO_LB
> >> +       bool "Priority load balance for CFS"
> >> +       depends on SMP
> >> +       default n
> >> +       help
> >> +         This feature enable CFS priority load balance to reduce
> >> +         non-idle tasks latency interferenced by SCHED_IDLE tasks.
> >> +         It prefer migrating non-idle tasks firstly and
> >> +         migrating SCHED_IDLE tasks lastly.
> >> +
> >>   config RT_GROUP_SCHED
> >>          bool "Group scheduling for SCHED_RR/FIFO"
> >>          depends on CGROUP_SCHED
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index 5800b0623ff3..9be35431fdd5 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -9731,6 +9731,9 @@ void __init sched_init(void)
> >>                  rq->max_idle_balance_cost = sysctl_sched_migration_cost;
> >>
> >>                  INIT_LIST_HEAD(&rq->cfs_tasks);
> >> +#ifdef CONFIG_SCHED_PRIO_LB
> >> +               INIT_LIST_HEAD(&rq->cfs_idle_tasks);
> >> +#endif
> >>
> >>                  rq_attach_root(rq, &def_root_domain);
> >>   #ifdef CONFIG_NO_HZ_COMMON
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index e4a0b8bd941c..bdeb04324f0c 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -139,6 +139,10 @@ static int __init setup_sched_thermal_decay_shift(char *str)
> >>   }
> >>   __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
> >>
> >> +#ifdef CONFIG_SCHED_PRIO_LB
> >> +unsigned int sysctl_sched_prio_load_balance_enabled;
> >> +#endif
> >> +
> >>   #ifdef CONFIG_SMP
> >>   /*
> >>    * For asym packing, by default the lower numbered CPU has higher priority.
> >> @@ -3199,6 +3203,21 @@ static inline void update_scan_period(struct task_struct *p, int new_cpu)
> >>
> >>   #endif /* CONFIG_NUMA_BALANCING */
> >>
> >> +#ifdef CONFIG_SCHED_PRIO_LB
> >> +static void
> >> +adjust_rq_cfs_tasks(
> >> +       void (*list_op)(struct list_head *, struct list_head *),
> >> +       struct rq *rq,
> >> +       struct sched_entity *se)
> >> +{
> >> +       if (sysctl_sched_prio_load_balance_enabled &&
> >> +               task_has_idle_policy(task_of(se)))
> >> +               (*list_op)(&se->group_node, &rq->cfs_idle_tasks);
> >> +       else
> >> +               (*list_op)(&se->group_node, &rq->cfs_tasks);
> >> +}
> >> +#endif
> >> +
> >>   static void
> >>   account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >>   {
> >> @@ -3208,7 +3227,11 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >>                  struct rq *rq = rq_of(cfs_rq);
> >>
> >>                  account_numa_enqueue(rq, task_of(se));
> >> +#ifdef CONFIG_SCHED_PRIO_LB
> >> +               adjust_rq_cfs_tasks(list_add, rq, se);
> >> +#else
> >>                  list_add(&se->group_node, &rq->cfs_tasks);
> >> +#endif
> >>          }
> >>   #endif
> >>          cfs_rq->nr_running++;
> >> @@ -7631,7 +7654,11 @@ done: __maybe_unused;
> >>           * the list, so our cfs_tasks list becomes MRU
> >>           * one.
> >>           */
> >> +#ifdef CONFIG_SCHED_PRIO_LB
> >> +       adjust_rq_cfs_tasks(list_move, rq, &p->se);
> >> +#else
> >>          list_move(&p->se.group_node, &rq->cfs_tasks);
> >> +#endif
> >>   #endif
> >>
> >>          if (hrtick_enabled_fair(rq))
> >> @@ -8156,11 +8183,18 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
> >>   static struct task_struct *detach_one_task(struct lb_env *env)
> >>   {
> >>          struct task_struct *p;
> >> +       struct list_head *tasks = &env->src_rq->cfs_tasks;
> >> +#ifdef CONFIG_SCHED_PRIO_LB
> >> +       bool has_detach_idle_tasks = false;
> >> +#endif
> >>
> >>          lockdep_assert_rq_held(env->src_rq);
> >>
> >> +#ifdef CONFIG_SCHED_PRIO_LB
> >> +again:
> >> +#endif
> >>          list_for_each_entry_reverse(p,
> >> -                       &env->src_rq->cfs_tasks, se.group_node) {
> >> +                       tasks, se.group_node) {
> >>                  if (!can_migrate_task(p, env))
> >>                          continue;
> >>
> >> @@ -8175,6 +8209,13 @@ static struct task_struct *detach_one_task(struct lb_env *env)
> >>                  schedstat_inc(env->sd->lb_gained[env->idle]);
> >>                  return p;
> >>          }
> >> +#ifdef CONFIG_SCHED_PRIO_LB
> >> +       if (sysctl_sched_prio_load_balance_enabled && !has_detach_idle_tasks) {
> >> +               has_detach_idle_tasks = true;
> >> +               tasks = &env->src_rq->cfs_idle_tasks;
> >> +               goto again;
> >> +       }
> >> +#endif
> >>          return NULL;
> >>   }
> >>
> >> @@ -8190,6 +8231,9 @@ static int detach_tasks(struct lb_env *env)
> >>          unsigned long util, load;
> >>          struct task_struct *p;
> >>          int detached = 0;
> >> +#ifdef CONFIG_SCHED_PRIO_LB
> >> +       bool has_detach_idle_tasks = false;
> >> +#endif
> >>
> >>          lockdep_assert_rq_held(env->src_rq);
> >>
> >> @@ -8205,6 +8249,9 @@ static int detach_tasks(struct lb_env *env)
> >>          if (env->imbalance <= 0)
> >>                  return 0;
> >>
> >> +#ifdef CONFIG_SCHED_PRIO_LB
> >> +again:
> >> +#endif
> >>          while (!list_empty(tasks)) {
> >>                  /*
> >>                   * We don't want to steal all, otherwise we may be treated likewise,
> >> @@ -8310,6 +8357,14 @@ static int detach_tasks(struct lb_env *env)
> >>                  list_move(&p->se.group_node, tasks);
> >>          }
> >>
> >> +#ifdef CONFIG_SCHED_PRIO_LB
> >> +       if (sysctl_sched_prio_load_balance_enabled &&
> >> +               !has_detach_idle_tasks && env->imbalance > 0) {
> >> +               has_detach_idle_tasks = true;
> >> +               tasks = &env->src_rq->cfs_idle_tasks;
> >> +               goto again;
> >> +       }
> >> +#endif
> >>          /*
> >>           * Right now, this is one of only two places we collect this stat
> >>           * so we can safely collect detach_one_task() stats here rather
> >> @@ -11814,7 +11869,11 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
> >>                   * Move the next running task to the front of the list, so our
> >>                   * cfs_tasks list becomes MRU one.
> >>                   */
> >> +#ifdef CONFIG_SCHED_PRIO_LB
> >> +               adjust_rq_cfs_tasks(list_move, rq, se);
> >> +#else
> >>                  list_move(&se->group_node, &rq->cfs_tasks);
> >> +#endif
> >>          }
> >>   #endif
> >>
> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> index 1644242ecd11..1b831c05ba30 100644
> >> --- a/kernel/sched/sched.h
> >> +++ b/kernel/sched/sched.h
> >> @@ -1053,6 +1053,9 @@ struct rq {
> >>          int                     online;
> >>
> >>          struct list_head cfs_tasks;
> >> +#ifdef CONFIG_SCHED_PRIO_LB
> >> +       struct list_head cfs_idle_tasks;
> >> +#endif
> >>
> >>          struct sched_avg        avg_rt;
> >>          struct sched_avg        avg_dl;
> >> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> >> index 188c305aeb8b..5fc0f9ffb675 100644
> >> --- a/kernel/sysctl.c
> >> +++ b/kernel/sysctl.c
> >> @@ -2090,6 +2090,17 @@ static struct ctl_table kern_table[] = {
> >>                  .extra1         = SYSCTL_ONE,
> >>                  .extra2         = SYSCTL_INT_MAX,
> >>          },
> >> +#endif
> >> +#ifdef CONFIG_SCHED_PRIO_LB
> >> +       {
> >> +               .procname       = "sched_prio_load_balance_enabled",
> >> +               .data           = &sysctl_sched_prio_load_balance_enabled,
> >> +               .maxlen         = sizeof(unsigned int),
> >> +               .mode           = 0644,
> >> +               .proc_handler   = proc_dointvec_minmax,
> >> +               .extra1         = SYSCTL_ZERO,
> >> +               .extra2         = SYSCTL_ONE,
> >> +       },
> >>   #endif
> >>          { }
> >>   };
> >> --
> >> 2.27.0
> >>
> > .

Song Zhang Nov. 3, 2022, 9:20 a.m. UTC | #4

On 2022/11/3 16:33, Vincent Guittot wrote:
> On Thu, 3 Nov 2022 at 04:01, Song Zhang <zhangsong34@huawei.com> wrote:
>>
>> Thanks for your reply!
>>
>> On 2022/11/3 2:01, Vincent Guittot wrote:
>>> On Wed, 2 Nov 2022 at 04:54, Song Zhang <zhangsong34@huawei.com> wrote:
>>>>
>>>
>>> This really looks like a v3 of
>>> https://lore.kernel.org/all/20220810015636.3865248-1-zhangsong34@huawei.com/
>>>
>>> Please keep versioning.
>>>
>>>> Add a new sysctl interface:
>>>> /proc/sys/kernel/sched_prio_load_balance_enabled
>>>
>>> We don't want to add more sysctl knobs for the scheduler, we even
>>> removed some. Knob usually means that you want to fix your use case
>>> but the solution doesn't make sense for all cases.
>>>
>>
>> OK, I will remove this knobs later.
>>
>>>>
>>>> 0: default behavior
>>>> 1: enable priority load balance for CFS
>>>>
>>>> For co-location with idle and non-idle tasks, when CFS do load balance,
>>>> it is reasonable to prefer migrating non-idle tasks and migrating idle
>>>> tasks lastly. This will reduce the interference by SCHED_IDLE tasks
>>>> as much as possible.
>>>
>>> I don't agree that it's always the best choice to migrate a non-idle task 1st.
>>>
>>> CPU0 has 1 non idle task and CPU1 has 1 non idle task and hundreds of
>>> idle task and there is an imbalance between the 2 CPUS: migrating the
>>> non idle task from CPU1 to CPU0 is not the best choice
>>>
>>
>> If the non idle task on CPU1 is running or cache hot, it cannot be
>> migrated and idle tasks can also be migrated from CPU1 to CPU0. So I
>> think it does not matter.
> 
> What I mean is that migrating non idle tasks first is not a universal
> win and not always what we want.
> 

But migrating online tasks first is mostly a trade-off that 
non-idle(Latency Sensitive) tasks can obtain more CPU time and minimize 
the interference caused by IDLE tasks. I think this makes sense in most 
cases, or you can point out what else I need to think about it ?

Best regards.

>>
>>>>
>>>> Testcase:
>>>> - Spawn large number of idle(SCHED_IDLE) tasks occupy CPUs
>>>
>>> What do you mean by a large number ?
>>>
>>>> - Let non-idle tasks compete with idle tasks for CPU time.
>>>>
>>>> Using schbench to test non-idle tasks latency:
>>>> $ ./schbench -m 1 -t 10 -r 30 -R 200
>>>
>>> How many CPUs do you have ?
>>>
>>
>> OK, some details may not be mentioned.
>> My virtual machine has 8 CPUs running with a schbench process and 5000
>> idle tasks. The idle task is a while dead loop process below:
> 
> How can you care about latency when you start 10 workers on 8 vCPUs
> with 5000 non idle threads ?
> 

No no no... spawn 5000 idle(SCHED_IDLE) processes not 5000 non-idle 
threads, and with 10 non-idle schbench workers on 8 vCPUs.

>>
>> $ cat idle_process.c
>> int main()
>> {
>>           int i = 0;
>>           while(1) {
>>                   usleep(500);
>>                   for(i = 0; i < 1000000; i++);
>>           }
>> }
>>
>> You can compile and spawn 5000 idle(SCHED_IDLE) tasks occupying 8 CPUs
>> and execute schbench command to test it.
>>
>>>>
>>>> Test result:
>>>> 1.Default behavior
>>>> Latency percentiles (usec) runtime 30 (s) (4562 total samples)
>>>>           50.0th: 62528 (2281 samples)
>>>>           75.0th: 623616 (1141 samples)
>>>>           90.0th: 764928 (687 samples)
>>>>           95.0th: 824320 (225 samples)
>>>>           *99.0th: 920576 (183 samples)
>>>>           99.5th: 953344 (23 samples)
>>>>           99.9th: 1008640 (18 samples)
>>>>           min=9, max=1074466
>>>>
>>>> 2.Enable priority load balance
>>>> Latency percentiles (usec) runtime 30 (s) (4391 total samples)
>>>>           50.0th: 22624 (2204 samples)
>>>>           75.0th: 48832 (1092 samples)
>>>>           90.0th: 85376 (657 samples)
>>>>           95.0th: 113280 (220 samples)
>>>>           *99.0th: 182528 (175 samples)
>>>>           99.5th: 206592 (22 samples)
>>>>           99.9th: 290304 (17 samples)
>>>>           min=6, max=351815
>>>>
>>>>   From percentile details, we see the benefit of priority load balance
>>>> that 95% of non-idle tasks latencies stays no more than 113ms, while
>>>
>>> But even 113ms seems quite a large number if there is anything else
>>> but 10 schbench workers and a bunch of idle threads that are running.
>>>
>>>> non-idle tasks latencies has got almost 50% over 600ms if priority
>>>> load balance not enabled.
>>>
>>> Als have you considered enabling sched_feature LB_MIN ?
>>>
>>
>> I have tried to echo LB_MIN > /sys/kernel/debug/sched/features, but this
>> feature seems make no sense.
>>
>>>>
>>>> Signed-off-by: Song Zhang <zhangsong34@huawei.com>
>>>> ---
>>>>    include/linux/sched/sysctl.h |  4 +++
>>>>    init/Kconfig                 | 10 ++++++
>>>>    kernel/sched/core.c          |  3 ++
>>>>    kernel/sched/fair.c          | 61 +++++++++++++++++++++++++++++++++++-
>>>>    kernel/sched/sched.h         |  3 ++
>>>>    kernel/sysctl.c              | 11 +++++++
>>>>    6 files changed, 91 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
>>>> index 303ee7dd0c7e..9b3673269ecc 100644
>>>> --- a/include/linux/sched/sysctl.h
>>>> +++ b/include/linux/sched/sysctl.h
>>>> @@ -32,6 +32,10 @@ extern unsigned int sysctl_numa_balancing_promote_rate_limit;
>>>>    #define sysctl_numa_balancing_mode     0
>>>>    #endif
>>>>
>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>> +extern unsigned int sysctl_sched_prio_load_balance_enabled;
>>>> +#endif
>>>> +
>>>>    int sysctl_numa_balancing(struct ctl_table *table, int write, void *buffer,
>>>>                   size_t *lenp, loff_t *ppos);
>>>>
>>>> diff --git a/init/Kconfig b/init/Kconfig
>>>> index 694f7c160c9c..b0dfe6701218 100644
>>>> --- a/init/Kconfig
>>>> +++ b/init/Kconfig
>>>> @@ -1026,6 +1026,16 @@ config CFS_BANDWIDTH
>>>>             restriction.
>>>>             See Documentation/scheduler/sched-bwc.rst for more information.
>>>>
>>>> +config SCHED_PRIO_LB
>>>> +       bool "Priority load balance for CFS"
>>>> +       depends on SMP
>>>> +       default n
>>>> +       help
>>>> +         This feature enable CFS priority load balance to reduce
>>>> +         non-idle tasks latency interferenced by SCHED_IDLE tasks.
>>>> +         It prefer migrating non-idle tasks firstly and
>>>> +         migrating SCHED_IDLE tasks lastly.
>>>> +
>>>>    config RT_GROUP_SCHED
>>>>           bool "Group scheduling for SCHED_RR/FIFO"
>>>>           depends on CGROUP_SCHED
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 5800b0623ff3..9be35431fdd5 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -9731,6 +9731,9 @@ void __init sched_init(void)
>>>>                   rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>>>>
>>>>                   INIT_LIST_HEAD(&rq->cfs_tasks);
>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>> +               INIT_LIST_HEAD(&rq->cfs_idle_tasks);
>>>> +#endif
>>>>
>>>>                   rq_attach_root(rq, &def_root_domain);
>>>>    #ifdef CONFIG_NO_HZ_COMMON
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index e4a0b8bd941c..bdeb04324f0c 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -139,6 +139,10 @@ static int __init setup_sched_thermal_decay_shift(char *str)
>>>>    }
>>>>    __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
>>>>
>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>> +unsigned int sysctl_sched_prio_load_balance_enabled;
>>>> +#endif
>>>> +
>>>>    #ifdef CONFIG_SMP
>>>>    /*
>>>>     * For asym packing, by default the lower numbered CPU has higher priority.
>>>> @@ -3199,6 +3203,21 @@ static inline void update_scan_period(struct task_struct *p, int new_cpu)
>>>>
>>>>    #endif /* CONFIG_NUMA_BALANCING */
>>>>
>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>> +static void
>>>> +adjust_rq_cfs_tasks(
>>>> +       void (*list_op)(struct list_head *, struct list_head *),
>>>> +       struct rq *rq,
>>>> +       struct sched_entity *se)
>>>> +{
>>>> +       if (sysctl_sched_prio_load_balance_enabled &&
>>>> +               task_has_idle_policy(task_of(se)))
>>>> +               (*list_op)(&se->group_node, &rq->cfs_idle_tasks);
>>>> +       else
>>>> +               (*list_op)(&se->group_node, &rq->cfs_tasks);
>>>> +}
>>>> +#endif
>>>> +
>>>>    static void
>>>>    account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>>    {
>>>> @@ -3208,7 +3227,11 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>>                   struct rq *rq = rq_of(cfs_rq);
>>>>
>>>>                   account_numa_enqueue(rq, task_of(se));
>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>> +               adjust_rq_cfs_tasks(list_add, rq, se);
>>>> +#else
>>>>                   list_add(&se->group_node, &rq->cfs_tasks);
>>>> +#endif
>>>>           }
>>>>    #endif
>>>>           cfs_rq->nr_running++;
>>>> @@ -7631,7 +7654,11 @@ done: __maybe_unused;
>>>>            * the list, so our cfs_tasks list becomes MRU
>>>>            * one.
>>>>            */
>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>> +       adjust_rq_cfs_tasks(list_move, rq, &p->se);
>>>> +#else
>>>>           list_move(&p->se.group_node, &rq->cfs_tasks);
>>>> +#endif
>>>>    #endif
>>>>
>>>>           if (hrtick_enabled_fair(rq))
>>>> @@ -8156,11 +8183,18 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
>>>>    static struct task_struct *detach_one_task(struct lb_env *env)
>>>>    {
>>>>           struct task_struct *p;
>>>> +       struct list_head *tasks = &env->src_rq->cfs_tasks;
>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>> +       bool has_detach_idle_tasks = false;
>>>> +#endif
>>>>
>>>>           lockdep_assert_rq_held(env->src_rq);
>>>>
>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>> +again:
>>>> +#endif
>>>>           list_for_each_entry_reverse(p,
>>>> -                       &env->src_rq->cfs_tasks, se.group_node) {
>>>> +                       tasks, se.group_node) {
>>>>                   if (!can_migrate_task(p, env))
>>>>                           continue;
>>>>
>>>> @@ -8175,6 +8209,13 @@ static struct task_struct *detach_one_task(struct lb_env *env)
>>>>                   schedstat_inc(env->sd->lb_gained[env->idle]);
>>>>                   return p;
>>>>           }
>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>> +       if (sysctl_sched_prio_load_balance_enabled && !has_detach_idle_tasks) {
>>>> +               has_detach_idle_tasks = true;
>>>> +               tasks = &env->src_rq->cfs_idle_tasks;
>>>> +               goto again;
>>>> +       }
>>>> +#endif
>>>>           return NULL;
>>>>    }
>>>>
>>>> @@ -8190,6 +8231,9 @@ static int detach_tasks(struct lb_env *env)
>>>>           unsigned long util, load;
>>>>           struct task_struct *p;
>>>>           int detached = 0;
>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>> +       bool has_detach_idle_tasks = false;
>>>> +#endif
>>>>
>>>>           lockdep_assert_rq_held(env->src_rq);
>>>>
>>>> @@ -8205,6 +8249,9 @@ static int detach_tasks(struct lb_env *env)
>>>>           if (env->imbalance <= 0)
>>>>                   return 0;
>>>>
>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>> +again:
>>>> +#endif
>>>>           while (!list_empty(tasks)) {
>>>>                   /*
>>>>                    * We don't want to steal all, otherwise we may be treated likewise,
>>>> @@ -8310,6 +8357,14 @@ static int detach_tasks(struct lb_env *env)
>>>>                   list_move(&p->se.group_node, tasks);
>>>>           }
>>>>
>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>> +       if (sysctl_sched_prio_load_balance_enabled &&
>>>> +               !has_detach_idle_tasks && env->imbalance > 0) {
>>>> +               has_detach_idle_tasks = true;
>>>> +               tasks = &env->src_rq->cfs_idle_tasks;
>>>> +               goto again;
>>>> +       }
>>>> +#endif
>>>>           /*
>>>>            * Right now, this is one of only two places we collect this stat
>>>>            * so we can safely collect detach_one_task() stats here rather
>>>> @@ -11814,7 +11869,11 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
>>>>                    * Move the next running task to the front of the list, so our
>>>>                    * cfs_tasks list becomes MRU one.
>>>>                    */
>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>> +               adjust_rq_cfs_tasks(list_move, rq, se);
>>>> +#else
>>>>                   list_move(&se->group_node, &rq->cfs_tasks);
>>>> +#endif
>>>>           }
>>>>    #endif
>>>>
>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>>> index 1644242ecd11..1b831c05ba30 100644
>>>> --- a/kernel/sched/sched.h
>>>> +++ b/kernel/sched/sched.h
>>>> @@ -1053,6 +1053,9 @@ struct rq {
>>>>           int                     online;
>>>>
>>>>           struct list_head cfs_tasks;
>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>> +       struct list_head cfs_idle_tasks;
>>>> +#endif
>>>>
>>>>           struct sched_avg        avg_rt;
>>>>           struct sched_avg        avg_dl;
>>>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>>>> index 188c305aeb8b..5fc0f9ffb675 100644
>>>> --- a/kernel/sysctl.c
>>>> +++ b/kernel/sysctl.c
>>>> @@ -2090,6 +2090,17 @@ static struct ctl_table kern_table[] = {
>>>>                   .extra1         = SYSCTL_ONE,
>>>>                   .extra2         = SYSCTL_INT_MAX,
>>>>           },
>>>> +#endif
>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>> +       {
>>>> +               .procname       = "sched_prio_load_balance_enabled",
>>>> +               .data           = &sysctl_sched_prio_load_balance_enabled,
>>>> +               .maxlen         = sizeof(unsigned int),
>>>> +               .mode           = 0644,
>>>> +               .proc_handler   = proc_dointvec_minmax,
>>>> +               .extra1         = SYSCTL_ZERO,
>>>> +               .extra2         = SYSCTL_ONE,
>>>> +       },
>>>>    #endif
>>>>           { }
>>>>    };
>>>> --
>>>> 2.27.0
>>>>
>>> .
> .

Vincent Guittot Nov. 3, 2022, 9:22 a.m. UTC | #5

On Thu, 3 Nov 2022 at 10:20, Song Zhang <zhangsong34@huawei.com> wrote:
>
>
>
> On 2022/11/3 16:33, Vincent Guittot wrote:
> > On Thu, 3 Nov 2022 at 04:01, Song Zhang <zhangsong34@huawei.com> wrote:
> >>
> >> Thanks for your reply!
> >>
> >> On 2022/11/3 2:01, Vincent Guittot wrote:
> >>> On Wed, 2 Nov 2022 at 04:54, Song Zhang <zhangsong34@huawei.com> wrote:
> >>>>
> >>>
> >>> This really looks like a v3 of
> >>> https://lore.kernel.org/all/20220810015636.3865248-1-zhangsong34@huawei.com/
> >>>
> >>> Please keep versioning.
> >>>
> >>>> Add a new sysctl interface:
> >>>> /proc/sys/kernel/sched_prio_load_balance_enabled
> >>>
> >>> We don't want to add more sysctl knobs for the scheduler, we even
> >>> removed some. Knob usually means that you want to fix your use case
> >>> but the solution doesn't make sense for all cases.
> >>>
> >>
> >> OK, I will remove this knobs later.
> >>
> >>>>
> >>>> 0: default behavior
> >>>> 1: enable priority load balance for CFS
> >>>>
> >>>> For co-location with idle and non-idle tasks, when CFS do load balance,
> >>>> it is reasonable to prefer migrating non-idle tasks and migrating idle
> >>>> tasks lastly. This will reduce the interference by SCHED_IDLE tasks
> >>>> as much as possible.
> >>>
> >>> I don't agree that it's always the best choice to migrate a non-idle task 1st.
> >>>
> >>> CPU0 has 1 non idle task and CPU1 has 1 non idle task and hundreds of
> >>> idle task and there is an imbalance between the 2 CPUS: migrating the
> >>> non idle task from CPU1 to CPU0 is not the best choice
> >>>
> >>
> >> If the non idle task on CPU1 is running or cache hot, it cannot be
> >> migrated and idle tasks can also be migrated from CPU1 to CPU0. So I
> >> think it does not matter.
> >
> > What I mean is that migrating non idle tasks first is not a universal
> > win and not always what we want.
> >
>
> But migrating online tasks first is mostly a trade-off that
> non-idle(Latency Sensitive) tasks can obtain more CPU time and minimize
> the interference caused by IDLE tasks. I think this makes sense in most
> cases, or you can point out what else I need to think about it ?
>
> Best regards.
>
> >>
> >>>>
> >>>> Testcase:
> >>>> - Spawn large number of idle(SCHED_IDLE) tasks occupy CPUs
> >>>
> >>> What do you mean by a large number ?
> >>>
> >>>> - Let non-idle tasks compete with idle tasks for CPU time.
> >>>>
> >>>> Using schbench to test non-idle tasks latency:
> >>>> $ ./schbench -m 1 -t 10 -r 30 -R 200
> >>>
> >>> How many CPUs do you have ?
> >>>
> >>
> >> OK, some details may not be mentioned.
> >> My virtual machine has 8 CPUs running with a schbench process and 5000
> >> idle tasks. The idle task is a while dead loop process below:
> >
> > How can you care about latency when you start 10 workers on 8 vCPUs
> > with 5000 non idle threads ?
> >
>
> No no no... spawn 5000 idle(SCHED_IDLE) processes not 5000 non-idle
> threads, and with 10 non-idle schbench workers on 8 vCPUs.

yes spawn 5000 idle tasks but my point remains the same

>
> >>
> >> $ cat idle_process.c
> >> int main()
> >> {
> >>           int i = 0;
> >>           while(1) {
> >>                   usleep(500);
> >>                   for(i = 0; i < 1000000; i++);
> >>           }
> >> }
> >>
> >> You can compile and spawn 5000 idle(SCHED_IDLE) tasks occupying 8 CPUs
> >> and execute schbench command to test it.
> >>
> >>>>
> >>>> Test result:
> >>>> 1.Default behavior
> >>>> Latency percentiles (usec) runtime 30 (s) (4562 total samples)
> >>>>           50.0th: 62528 (2281 samples)
> >>>>           75.0th: 623616 (1141 samples)
> >>>>           90.0th: 764928 (687 samples)
> >>>>           95.0th: 824320 (225 samples)
> >>>>           *99.0th: 920576 (183 samples)
> >>>>           99.5th: 953344 (23 samples)
> >>>>           99.9th: 1008640 (18 samples)
> >>>>           min=9, max=1074466
> >>>>
> >>>> 2.Enable priority load balance
> >>>> Latency percentiles (usec) runtime 30 (s) (4391 total samples)
> >>>>           50.0th: 22624 (2204 samples)
> >>>>           75.0th: 48832 (1092 samples)
> >>>>           90.0th: 85376 (657 samples)
> >>>>           95.0th: 113280 (220 samples)
> >>>>           *99.0th: 182528 (175 samples)
> >>>>           99.5th: 206592 (22 samples)
> >>>>           99.9th: 290304 (17 samples)
> >>>>           min=6, max=351815
> >>>>
> >>>>   From percentile details, we see the benefit of priority load balance
> >>>> that 95% of non-idle tasks latencies stays no more than 113ms, while
> >>>
> >>> But even 113ms seems quite a large number if there is anything else
> >>> but 10 schbench workers and a bunch of idle threads that are running.
> >>>
> >>>> non-idle tasks latencies has got almost 50% over 600ms if priority
> >>>> load balance not enabled.
> >>>
> >>> Als have you considered enabling sched_feature LB_MIN ?
> >>>
> >>
> >> I have tried to echo LB_MIN > /sys/kernel/debug/sched/features, but this
> >> feature seems make no sense.
> >>
> >>>>
> >>>> Signed-off-by: Song Zhang <zhangsong34@huawei.com>
> >>>> ---
> >>>>    include/linux/sched/sysctl.h |  4 +++
> >>>>    init/Kconfig                 | 10 ++++++
> >>>>    kernel/sched/core.c          |  3 ++
> >>>>    kernel/sched/fair.c          | 61 +++++++++++++++++++++++++++++++++++-
> >>>>    kernel/sched/sched.h         |  3 ++
> >>>>    kernel/sysctl.c              | 11 +++++++
> >>>>    6 files changed, 91 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> >>>> index 303ee7dd0c7e..9b3673269ecc 100644
> >>>> --- a/include/linux/sched/sysctl.h
> >>>> +++ b/include/linux/sched/sysctl.h
> >>>> @@ -32,6 +32,10 @@ extern unsigned int sysctl_numa_balancing_promote_rate_limit;
> >>>>    #define sysctl_numa_balancing_mode     0
> >>>>    #endif
> >>>>
> >>>> +#ifdef CONFIG_SCHED_PRIO_LB
> >>>> +extern unsigned int sysctl_sched_prio_load_balance_enabled;
> >>>> +#endif
> >>>> +
> >>>>    int sysctl_numa_balancing(struct ctl_table *table, int write, void *buffer,
> >>>>                   size_t *lenp, loff_t *ppos);
> >>>>
> >>>> diff --git a/init/Kconfig b/init/Kconfig
> >>>> index 694f7c160c9c..b0dfe6701218 100644
> >>>> --- a/init/Kconfig
> >>>> +++ b/init/Kconfig
> >>>> @@ -1026,6 +1026,16 @@ config CFS_BANDWIDTH
> >>>>             restriction.
> >>>>             See Documentation/scheduler/sched-bwc.rst for more information.
> >>>>
> >>>> +config SCHED_PRIO_LB
> >>>> +       bool "Priority load balance for CFS"
> >>>> +       depends on SMP
> >>>> +       default n
> >>>> +       help
> >>>> +         This feature enable CFS priority load balance to reduce
> >>>> +         non-idle tasks latency interferenced by SCHED_IDLE tasks.
> >>>> +         It prefer migrating non-idle tasks firstly and
> >>>> +         migrating SCHED_IDLE tasks lastly.
> >>>> +
> >>>>    config RT_GROUP_SCHED
> >>>>           bool "Group scheduling for SCHED_RR/FIFO"
> >>>>           depends on CGROUP_SCHED
> >>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >>>> index 5800b0623ff3..9be35431fdd5 100644
> >>>> --- a/kernel/sched/core.c
> >>>> +++ b/kernel/sched/core.c
> >>>> @@ -9731,6 +9731,9 @@ void __init sched_init(void)
> >>>>                   rq->max_idle_balance_cost = sysctl_sched_migration_cost;
> >>>>
> >>>>                   INIT_LIST_HEAD(&rq->cfs_tasks);
> >>>> +#ifdef CONFIG_SCHED_PRIO_LB
> >>>> +               INIT_LIST_HEAD(&rq->cfs_idle_tasks);
> >>>> +#endif
> >>>>
> >>>>                   rq_attach_root(rq, &def_root_domain);
> >>>>    #ifdef CONFIG_NO_HZ_COMMON
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index e4a0b8bd941c..bdeb04324f0c 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -139,6 +139,10 @@ static int __init setup_sched_thermal_decay_shift(char *str)
> >>>>    }
> >>>>    __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
> >>>>
> >>>> +#ifdef CONFIG_SCHED_PRIO_LB
> >>>> +unsigned int sysctl_sched_prio_load_balance_enabled;
> >>>> +#endif
> >>>> +
> >>>>    #ifdef CONFIG_SMP
> >>>>    /*
> >>>>     * For asym packing, by default the lower numbered CPU has higher priority.
> >>>> @@ -3199,6 +3203,21 @@ static inline void update_scan_period(struct task_struct *p, int new_cpu)
> >>>>
> >>>>    #endif /* CONFIG_NUMA_BALANCING */
> >>>>
> >>>> +#ifdef CONFIG_SCHED_PRIO_LB
> >>>> +static void
> >>>> +adjust_rq_cfs_tasks(
> >>>> +       void (*list_op)(struct list_head *, struct list_head *),
> >>>> +       struct rq *rq,
> >>>> +       struct sched_entity *se)
> >>>> +{
> >>>> +       if (sysctl_sched_prio_load_balance_enabled &&
> >>>> +               task_has_idle_policy(task_of(se)))
> >>>> +               (*list_op)(&se->group_node, &rq->cfs_idle_tasks);
> >>>> +       else
> >>>> +               (*list_op)(&se->group_node, &rq->cfs_tasks);
> >>>> +}
> >>>> +#endif
> >>>> +
> >>>>    static void
> >>>>    account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >>>>    {
> >>>> @@ -3208,7 +3227,11 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >>>>                   struct rq *rq = rq_of(cfs_rq);
> >>>>
> >>>>                   account_numa_enqueue(rq, task_of(se));
> >>>> +#ifdef CONFIG_SCHED_PRIO_LB
> >>>> +               adjust_rq_cfs_tasks(list_add, rq, se);
> >>>> +#else
> >>>>                   list_add(&se->group_node, &rq->cfs_tasks);
> >>>> +#endif
> >>>>           }
> >>>>    #endif
> >>>>           cfs_rq->nr_running++;
> >>>> @@ -7631,7 +7654,11 @@ done: __maybe_unused;
> >>>>            * the list, so our cfs_tasks list becomes MRU
> >>>>            * one.
> >>>>            */
> >>>> +#ifdef CONFIG_SCHED_PRIO_LB
> >>>> +       adjust_rq_cfs_tasks(list_move, rq, &p->se);
> >>>> +#else
> >>>>           list_move(&p->se.group_node, &rq->cfs_tasks);
> >>>> +#endif
> >>>>    #endif
> >>>>
> >>>>           if (hrtick_enabled_fair(rq))
> >>>> @@ -8156,11 +8183,18 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
> >>>>    static struct task_struct *detach_one_task(struct lb_env *env)
> >>>>    {
> >>>>           struct task_struct *p;
> >>>> +       struct list_head *tasks = &env->src_rq->cfs_tasks;
> >>>> +#ifdef CONFIG_SCHED_PRIO_LB
> >>>> +       bool has_detach_idle_tasks = false;
> >>>> +#endif
> >>>>
> >>>>           lockdep_assert_rq_held(env->src_rq);
> >>>>
> >>>> +#ifdef CONFIG_SCHED_PRIO_LB
> >>>> +again:
> >>>> +#endif
> >>>>           list_for_each_entry_reverse(p,
> >>>> -                       &env->src_rq->cfs_tasks, se.group_node) {
> >>>> +                       tasks, se.group_node) {
> >>>>                   if (!can_migrate_task(p, env))
> >>>>                           continue;
> >>>>
> >>>> @@ -8175,6 +8209,13 @@ static struct task_struct *detach_one_task(struct lb_env *env)
> >>>>                   schedstat_inc(env->sd->lb_gained[env->idle]);
> >>>>                   return p;
> >>>>           }
> >>>> +#ifdef CONFIG_SCHED_PRIO_LB
> >>>> +       if (sysctl_sched_prio_load_balance_enabled && !has_detach_idle_tasks) {
> >>>> +               has_detach_idle_tasks = true;
> >>>> +               tasks = &env->src_rq->cfs_idle_tasks;
> >>>> +               goto again;
> >>>> +       }
> >>>> +#endif
> >>>>           return NULL;
> >>>>    }
> >>>>
> >>>> @@ -8190,6 +8231,9 @@ static int detach_tasks(struct lb_env *env)
> >>>>           unsigned long util, load;
> >>>>           struct task_struct *p;
> >>>>           int detached = 0;
> >>>> +#ifdef CONFIG_SCHED_PRIO_LB
> >>>> +       bool has_detach_idle_tasks = false;
> >>>> +#endif
> >>>>
> >>>>           lockdep_assert_rq_held(env->src_rq);
> >>>>
> >>>> @@ -8205,6 +8249,9 @@ static int detach_tasks(struct lb_env *env)
> >>>>           if (env->imbalance <= 0)
> >>>>                   return 0;
> >>>>
> >>>> +#ifdef CONFIG_SCHED_PRIO_LB
> >>>> +again:
> >>>> +#endif
> >>>>           while (!list_empty(tasks)) {
> >>>>                   /*
> >>>>                    * We don't want to steal all, otherwise we may be treated likewise,
> >>>> @@ -8310,6 +8357,14 @@ static int detach_tasks(struct lb_env *env)
> >>>>                   list_move(&p->se.group_node, tasks);
> >>>>           }
> >>>>
> >>>> +#ifdef CONFIG_SCHED_PRIO_LB
> >>>> +       if (sysctl_sched_prio_load_balance_enabled &&
> >>>> +               !has_detach_idle_tasks && env->imbalance > 0) {
> >>>> +               has_detach_idle_tasks = true;
> >>>> +               tasks = &env->src_rq->cfs_idle_tasks;
> >>>> +               goto again;
> >>>> +       }
> >>>> +#endif
> >>>>           /*
> >>>>            * Right now, this is one of only two places we collect this stat
> >>>>            * so we can safely collect detach_one_task() stats here rather
> >>>> @@ -11814,7 +11869,11 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
> >>>>                    * Move the next running task to the front of the list, so our
> >>>>                    * cfs_tasks list becomes MRU one.
> >>>>                    */
> >>>> +#ifdef CONFIG_SCHED_PRIO_LB
> >>>> +               adjust_rq_cfs_tasks(list_move, rq, se);
> >>>> +#else
> >>>>                   list_move(&se->group_node, &rq->cfs_tasks);
> >>>> +#endif
> >>>>           }
> >>>>    #endif
> >>>>
> >>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >>>> index 1644242ecd11..1b831c05ba30 100644
> >>>> --- a/kernel/sched/sched.h
> >>>> +++ b/kernel/sched/sched.h
> >>>> @@ -1053,6 +1053,9 @@ struct rq {
> >>>>           int                     online;
> >>>>
> >>>>           struct list_head cfs_tasks;
> >>>> +#ifdef CONFIG_SCHED_PRIO_LB
> >>>> +       struct list_head cfs_idle_tasks;
> >>>> +#endif
> >>>>
> >>>>           struct sched_avg        avg_rt;
> >>>>           struct sched_avg        avg_dl;
> >>>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> >>>> index 188c305aeb8b..5fc0f9ffb675 100644
> >>>> --- a/kernel/sysctl.c
> >>>> +++ b/kernel/sysctl.c
> >>>> @@ -2090,6 +2090,17 @@ static struct ctl_table kern_table[] = {
> >>>>                   .extra1         = SYSCTL_ONE,
> >>>>                   .extra2         = SYSCTL_INT_MAX,
> >>>>           },
> >>>> +#endif
> >>>> +#ifdef CONFIG_SCHED_PRIO_LB
> >>>> +       {
> >>>> +               .procname       = "sched_prio_load_balance_enabled",
> >>>> +               .data           = &sysctl_sched_prio_load_balance_enabled,
> >>>> +               .maxlen         = sizeof(unsigned int),
> >>>> +               .mode           = 0644,
> >>>> +               .proc_handler   = proc_dointvec_minmax,
> >>>> +               .extra1         = SYSCTL_ZERO,
> >>>> +               .extra2         = SYSCTL_ONE,
> >>>> +       },
> >>>>    #endif
> >>>>           { }
> >>>>    };
> >>>> --
> >>>> 2.27.0
> >>>>
> >>> .
> > .

Song Zhang Nov. 3, 2022, 11:45 a.m. UTC | #6

On 2022/11/3 17:22, Vincent Guittot wrote:
> On Thu, 3 Nov 2022 at 10:20, Song Zhang <zhangsong34@huawei.com> wrote:
>>
>>
>>
>> On 2022/11/3 16:33, Vincent Guittot wrote:
>>> On Thu, 3 Nov 2022 at 04:01, Song Zhang <zhangsong34@huawei.com> wrote:
>>>>
>>>> Thanks for your reply!
>>>>
>>>> On 2022/11/3 2:01, Vincent Guittot wrote:
>>>>> On Wed, 2 Nov 2022 at 04:54, Song Zhang <zhangsong34@huawei.com> wrote:
>>>>>>
>>>>>
>>>>> This really looks like a v3 of
>>>>> https://lore.kernel.org/all/20220810015636.3865248-1-zhangsong34@huawei.com/
>>>>>
>>>>> Please keep versioning.
>>>>>
>>>>>> Add a new sysctl interface:
>>>>>> /proc/sys/kernel/sched_prio_load_balance_enabled
>>>>>
>>>>> We don't want to add more sysctl knobs for the scheduler, we even
>>>>> removed some. Knob usually means that you want to fix your use case
>>>>> but the solution doesn't make sense for all cases.
>>>>>
>>>>
>>>> OK, I will remove this knobs later.
>>>>
>>>>>>
>>>>>> 0: default behavior
>>>>>> 1: enable priority load balance for CFS
>>>>>>
>>>>>> For co-location with idle and non-idle tasks, when CFS do load balance,
>>>>>> it is reasonable to prefer migrating non-idle tasks and migrating idle
>>>>>> tasks lastly. This will reduce the interference by SCHED_IDLE tasks
>>>>>> as much as possible.
>>>>>
>>>>> I don't agree that it's always the best choice to migrate a non-idle task 1st.
>>>>>
>>>>> CPU0 has 1 non idle task and CPU1 has 1 non idle task and hundreds of
>>>>> idle task and there is an imbalance between the 2 CPUS: migrating the
>>>>> non idle task from CPU1 to CPU0 is not the best choice
>>>>>
>>>>
>>>> If the non idle task on CPU1 is running or cache hot, it cannot be
>>>> migrated and idle tasks can also be migrated from CPU1 to CPU0. So I
>>>> think it does not matter.
>>>
>>> What I mean is that migrating non idle tasks first is not a universal
>>> win and not always what we want.
>>>
>>
>> But migrating online tasks first is mostly a trade-off that
>> non-idle(Latency Sensitive) tasks can obtain more CPU time and minimize
>> the interference caused by IDLE tasks. I think this makes sense in most
>> cases, or you can point out what else I need to think about it ?
>>
>> Best regards.
>>
>>>>
>>>>>>
>>>>>> Testcase:
>>>>>> - Spawn large number of idle(SCHED_IDLE) tasks occupy CPUs
>>>>>
>>>>> What do you mean by a large number ?
>>>>>
>>>>>> - Let non-idle tasks compete with idle tasks for CPU time.
>>>>>>
>>>>>> Using schbench to test non-idle tasks latency:
>>>>>> $ ./schbench -m 1 -t 10 -r 30 -R 200
>>>>>
>>>>> How many CPUs do you have ?
>>>>>
>>>>
>>>> OK, some details may not be mentioned.
>>>> My virtual machine has 8 CPUs running with a schbench process and 5000
>>>> idle tasks. The idle task is a while dead loop process below:
>>>
>>> How can you care about latency when you start 10 workers on 8 vCPUs
>>> with 5000 non idle threads ?
>>>
>>
>> No no no... spawn 5000 idle(SCHED_IDLE) processes not 5000 non-idle
>> threads, and with 10 non-idle schbench workers on 8 vCPUs.
> 
> yes spawn 5000 idle tasks but my point remains the same
> 

I have switched to a new virtual machine with a relatively powerful CPU 
and retest it. Bound all tasks to CPU 0-5, and set schbench param 
--thread=10 and --rps=100. I have tried many times to test schbench 
latency. Here is the compare results for reference:

$ taskset -c 0-5 ./schbench -m 1 -t 10 -r 30 -R 100

Latency percentiles (usec)		Base		Base + Patch
50.0th: 				8208		1734
75.0th: 				36928		14224
90.0th: 				49088		42944
95.0th: 				62272		48960	(-27.1%)
*99.0th: 				92032		82048
99.5th: 				97408		87680
99.9th: 				114048		98944

 From the test result we can see schbench p95 latency can achive 27% 
decende if we migrate non-idle tasks first.

>>
>>>>
>>>> $ cat idle_process.c
>>>> int main()
>>>> {
>>>>            int i = 0;
>>>>            while(1) {
>>>>                    usleep(500);
>>>>                    for(i = 0; i < 1000000; i++);
>>>>            }
>>>> }
>>>>
>>>> You can compile and spawn 5000 idle(SCHED_IDLE) tasks occupying 8 CPUs
>>>> and execute schbench command to test it.
>>>>
>>>>>>
>>>>>> Test result:
>>>>>> 1.Default behavior
>>>>>> Latency percentiles (usec) runtime 30 (s) (4562 total samples)
>>>>>>            50.0th: 62528 (2281 samples)
>>>>>>            75.0th: 623616 (1141 samples)
>>>>>>            90.0th: 764928 (687 samples)
>>>>>>            95.0th: 824320 (225 samples)
>>>>>>            *99.0th: 920576 (183 samples)
>>>>>>            99.5th: 953344 (23 samples)
>>>>>>            99.9th: 1008640 (18 samples)
>>>>>>            min=9, max=1074466
>>>>>>
>>>>>> 2.Enable priority load balance
>>>>>> Latency percentiles (usec) runtime 30 (s) (4391 total samples)
>>>>>>            50.0th: 22624 (2204 samples)
>>>>>>            75.0th: 48832 (1092 samples)
>>>>>>            90.0th: 85376 (657 samples)
>>>>>>            95.0th: 113280 (220 samples)
>>>>>>            *99.0th: 182528 (175 samples)
>>>>>>            99.5th: 206592 (22 samples)
>>>>>>            99.9th: 290304 (17 samples)
>>>>>>            min=6, max=351815
>>>>>>
>>>>>>    From percentile details, we see the benefit of priority load balance
>>>>>> that 95% of non-idle tasks latencies stays no more than 113ms, while
>>>>>
>>>>> But even 113ms seems quite a large number if there is anything else
>>>>> but 10 schbench workers and a bunch of idle threads that are running.
>>>>>
>>>>>> non-idle tasks latencies has got almost 50% over 600ms if priority
>>>>>> load balance not enabled.
>>>>>
>>>>> Als have you considered enabling sched_feature LB_MIN ?
>>>>>
>>>>
>>>> I have tried to echo LB_MIN > /sys/kernel/debug/sched/features, but this
>>>> feature seems make no sense.
>>>>
>>>>>>
>>>>>> Signed-off-by: Song Zhang <zhangsong34@huawei.com>
>>>>>> ---
>>>>>>     include/linux/sched/sysctl.h |  4 +++
>>>>>>     init/Kconfig                 | 10 ++++++
>>>>>>     kernel/sched/core.c          |  3 ++
>>>>>>     kernel/sched/fair.c          | 61 +++++++++++++++++++++++++++++++++++-
>>>>>>     kernel/sched/sched.h         |  3 ++
>>>>>>     kernel/sysctl.c              | 11 +++++++
>>>>>>     6 files changed, 91 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
>>>>>> index 303ee7dd0c7e..9b3673269ecc 100644
>>>>>> --- a/include/linux/sched/sysctl.h
>>>>>> +++ b/include/linux/sched/sysctl.h
>>>>>> @@ -32,6 +32,10 @@ extern unsigned int sysctl_numa_balancing_promote_rate_limit;
>>>>>>     #define sysctl_numa_balancing_mode     0
>>>>>>     #endif
>>>>>>
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +extern unsigned int sysctl_sched_prio_load_balance_enabled;
>>>>>> +#endif
>>>>>> +
>>>>>>     int sysctl_numa_balancing(struct ctl_table *table, int write, void *buffer,
>>>>>>                    size_t *lenp, loff_t *ppos);
>>>>>>
>>>>>> diff --git a/init/Kconfig b/init/Kconfig
>>>>>> index 694f7c160c9c..b0dfe6701218 100644
>>>>>> --- a/init/Kconfig
>>>>>> +++ b/init/Kconfig
>>>>>> @@ -1026,6 +1026,16 @@ config CFS_BANDWIDTH
>>>>>>              restriction.
>>>>>>              See Documentation/scheduler/sched-bwc.rst for more information.
>>>>>>
>>>>>> +config SCHED_PRIO_LB
>>>>>> +       bool "Priority load balance for CFS"
>>>>>> +       depends on SMP
>>>>>> +       default n
>>>>>> +       help
>>>>>> +         This feature enable CFS priority load balance to reduce
>>>>>> +         non-idle tasks latency interferenced by SCHED_IDLE tasks.
>>>>>> +         It prefer migrating non-idle tasks firstly and
>>>>>> +         migrating SCHED_IDLE tasks lastly.
>>>>>> +
>>>>>>     config RT_GROUP_SCHED
>>>>>>            bool "Group scheduling for SCHED_RR/FIFO"
>>>>>>            depends on CGROUP_SCHED
>>>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>>>> index 5800b0623ff3..9be35431fdd5 100644
>>>>>> --- a/kernel/sched/core.c
>>>>>> +++ b/kernel/sched/core.c
>>>>>> @@ -9731,6 +9731,9 @@ void __init sched_init(void)
>>>>>>                    rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>>>>>>
>>>>>>                    INIT_LIST_HEAD(&rq->cfs_tasks);
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +               INIT_LIST_HEAD(&rq->cfs_idle_tasks);
>>>>>> +#endif
>>>>>>
>>>>>>                    rq_attach_root(rq, &def_root_domain);
>>>>>>     #ifdef CONFIG_NO_HZ_COMMON
>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>> index e4a0b8bd941c..bdeb04324f0c 100644
>>>>>> --- a/kernel/sched/fair.c
>>>>>> +++ b/kernel/sched/fair.c
>>>>>> @@ -139,6 +139,10 @@ static int __init setup_sched_thermal_decay_shift(char *str)
>>>>>>     }
>>>>>>     __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
>>>>>>
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +unsigned int sysctl_sched_prio_load_balance_enabled;
>>>>>> +#endif
>>>>>> +
>>>>>>     #ifdef CONFIG_SMP
>>>>>>     /*
>>>>>>      * For asym packing, by default the lower numbered CPU has higher priority.
>>>>>> @@ -3199,6 +3203,21 @@ static inline void update_scan_period(struct task_struct *p, int new_cpu)
>>>>>>
>>>>>>     #endif /* CONFIG_NUMA_BALANCING */
>>>>>>
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +static void
>>>>>> +adjust_rq_cfs_tasks(
>>>>>> +       void (*list_op)(struct list_head *, struct list_head *),
>>>>>> +       struct rq *rq,
>>>>>> +       struct sched_entity *se)
>>>>>> +{
>>>>>> +       if (sysctl_sched_prio_load_balance_enabled &&
>>>>>> +               task_has_idle_policy(task_of(se)))
>>>>>> +               (*list_op)(&se->group_node, &rq->cfs_idle_tasks);
>>>>>> +       else
>>>>>> +               (*list_op)(&se->group_node, &rq->cfs_tasks);
>>>>>> +}
>>>>>> +#endif
>>>>>> +
>>>>>>     static void
>>>>>>     account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>>>>     {
>>>>>> @@ -3208,7 +3227,11 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>>>>                    struct rq *rq = rq_of(cfs_rq);
>>>>>>
>>>>>>                    account_numa_enqueue(rq, task_of(se));
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +               adjust_rq_cfs_tasks(list_add, rq, se);
>>>>>> +#else
>>>>>>                    list_add(&se->group_node, &rq->cfs_tasks);
>>>>>> +#endif
>>>>>>            }
>>>>>>     #endif
>>>>>>            cfs_rq->nr_running++;
>>>>>> @@ -7631,7 +7654,11 @@ done: __maybe_unused;
>>>>>>             * the list, so our cfs_tasks list becomes MRU
>>>>>>             * one.
>>>>>>             */
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +       adjust_rq_cfs_tasks(list_move, rq, &p->se);
>>>>>> +#else
>>>>>>            list_move(&p->se.group_node, &rq->cfs_tasks);
>>>>>> +#endif
>>>>>>     #endif
>>>>>>
>>>>>>            if (hrtick_enabled_fair(rq))
>>>>>> @@ -8156,11 +8183,18 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
>>>>>>     static struct task_struct *detach_one_task(struct lb_env *env)
>>>>>>     {
>>>>>>            struct task_struct *p;
>>>>>> +       struct list_head *tasks = &env->src_rq->cfs_tasks;
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +       bool has_detach_idle_tasks = false;
>>>>>> +#endif
>>>>>>
>>>>>>            lockdep_assert_rq_held(env->src_rq);
>>>>>>
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +again:
>>>>>> +#endif
>>>>>>            list_for_each_entry_reverse(p,
>>>>>> -                       &env->src_rq->cfs_tasks, se.group_node) {
>>>>>> +                       tasks, se.group_node) {
>>>>>>                    if (!can_migrate_task(p, env))
>>>>>>                            continue;
>>>>>>
>>>>>> @@ -8175,6 +8209,13 @@ static struct task_struct *detach_one_task(struct lb_env *env)
>>>>>>                    schedstat_inc(env->sd->lb_gained[env->idle]);
>>>>>>                    return p;
>>>>>>            }
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +       if (sysctl_sched_prio_load_balance_enabled && !has_detach_idle_tasks) {
>>>>>> +               has_detach_idle_tasks = true;
>>>>>> +               tasks = &env->src_rq->cfs_idle_tasks;
>>>>>> +               goto again;
>>>>>> +       }
>>>>>> +#endif
>>>>>>            return NULL;
>>>>>>     }
>>>>>>
>>>>>> @@ -8190,6 +8231,9 @@ static int detach_tasks(struct lb_env *env)
>>>>>>            unsigned long util, load;
>>>>>>            struct task_struct *p;
>>>>>>            int detached = 0;
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +       bool has_detach_idle_tasks = false;
>>>>>> +#endif
>>>>>>
>>>>>>            lockdep_assert_rq_held(env->src_rq);
>>>>>>
>>>>>> @@ -8205,6 +8249,9 @@ static int detach_tasks(struct lb_env *env)
>>>>>>            if (env->imbalance <= 0)
>>>>>>                    return 0;
>>>>>>
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +again:
>>>>>> +#endif
>>>>>>            while (!list_empty(tasks)) {
>>>>>>                    /*
>>>>>>                     * We don't want to steal all, otherwise we may be treated likewise,
>>>>>> @@ -8310,6 +8357,14 @@ static int detach_tasks(struct lb_env *env)
>>>>>>                    list_move(&p->se.group_node, tasks);
>>>>>>            }
>>>>>>
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +       if (sysctl_sched_prio_load_balance_enabled &&
>>>>>> +               !has_detach_idle_tasks && env->imbalance > 0) {
>>>>>> +               has_detach_idle_tasks = true;
>>>>>> +               tasks = &env->src_rq->cfs_idle_tasks;
>>>>>> +               goto again;
>>>>>> +       }
>>>>>> +#endif
>>>>>>            /*
>>>>>>             * Right now, this is one of only two places we collect this stat
>>>>>>             * so we can safely collect detach_one_task() stats here rather
>>>>>> @@ -11814,7 +11869,11 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
>>>>>>                     * Move the next running task to the front of the list, so our
>>>>>>                     * cfs_tasks list becomes MRU one.
>>>>>>                     */
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +               adjust_rq_cfs_tasks(list_move, rq, se);
>>>>>> +#else
>>>>>>                    list_move(&se->group_node, &rq->cfs_tasks);
>>>>>> +#endif
>>>>>>            }
>>>>>>     #endif
>>>>>>
>>>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>>>>> index 1644242ecd11..1b831c05ba30 100644
>>>>>> --- a/kernel/sched/sched.h
>>>>>> +++ b/kernel/sched/sched.h
>>>>>> @@ -1053,6 +1053,9 @@ struct rq {
>>>>>>            int                     online;
>>>>>>
>>>>>>            struct list_head cfs_tasks;
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +       struct list_head cfs_idle_tasks;
>>>>>> +#endif
>>>>>>
>>>>>>            struct sched_avg        avg_rt;
>>>>>>            struct sched_avg        avg_dl;
>>>>>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>>>>>> index 188c305aeb8b..5fc0f9ffb675 100644
>>>>>> --- a/kernel/sysctl.c
>>>>>> +++ b/kernel/sysctl.c
>>>>>> @@ -2090,6 +2090,17 @@ static struct ctl_table kern_table[] = {
>>>>>>                    .extra1         = SYSCTL_ONE,
>>>>>>                    .extra2         = SYSCTL_INT_MAX,
>>>>>>            },
>>>>>> +#endif
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +       {
>>>>>> +               .procname       = "sched_prio_load_balance_enabled",
>>>>>> +               .data           = &sysctl_sched_prio_load_balance_enabled,
>>>>>> +               .maxlen         = sizeof(unsigned int),
>>>>>> +               .mode           = 0644,
>>>>>> +               .proc_handler   = proc_dointvec_minmax,
>>>>>> +               .extra1         = SYSCTL_ZERO,
>>>>>> +               .extra2         = SYSCTL_ONE,
>>>>>> +       },
>>>>>>     #endif
>>>>>>            { }
>>>>>>     };
>>>>>> --
>>>>>> 2.27.0
>>>>>>
>>>>> .
>>> .
> .

Song Zhang Nov. 4, 2022, 2:05 a.m. UTC | #7

On 2022/11/3 17:22, Vincent Guittot wrote:
> On Thu, 3 Nov 2022 at 10:20, Song Zhang <zhangsong34@huawei.com> wrote:
>>
>>
>>
>> On 2022/11/3 16:33, Vincent Guittot wrote:
>>> On Thu, 3 Nov 2022 at 04:01, Song Zhang <zhangsong34@huawei.com> wrote:
>>>>
>>>> Thanks for your reply!
>>>>
>>>> On 2022/11/3 2:01, Vincent Guittot wrote:
>>>>> On Wed, 2 Nov 2022 at 04:54, Song Zhang <zhangsong34@huawei.com> wrote:
>>>>>>
>>>>>
>>>>> This really looks like a v3 of
>>>>> https://lore.kernel.org/all/20220810015636.3865248-1-zhangsong34@huawei.com/
>>>>>
>>>>> Please keep versioning.
>>>>>
>>>>>> Add a new sysctl interface:
>>>>>> /proc/sys/kernel/sched_prio_load_balance_enabled
>>>>>
>>>>> We don't want to add more sysctl knobs for the scheduler, we even
>>>>> removed some. Knob usually means that you want to fix your use case
>>>>> but the solution doesn't make sense for all cases.
>>>>>
>>>>
>>>> OK, I will remove this knobs later.
>>>>
>>>>>>
>>>>>> 0: default behavior
>>>>>> 1: enable priority load balance for CFS
>>>>>>
>>>>>> For co-location with idle and non-idle tasks, when CFS do load balance,
>>>>>> it is reasonable to prefer migrating non-idle tasks and migrating idle
>>>>>> tasks lastly. This will reduce the interference by SCHED_IDLE tasks
>>>>>> as much as possible.
>>>>>
>>>>> I don't agree that it's always the best choice to migrate a non-idle task 1st.
>>>>>
>>>>> CPU0 has 1 non idle task and CPU1 has 1 non idle task and hundreds of
>>>>> idle task and there is an imbalance between the 2 CPUS: migrating the
>>>>> non idle task from CPU1 to CPU0 is not the best choice
>>>>>
>>>>
>>>> If the non idle task on CPU1 is running or cache hot, it cannot be
>>>> migrated and idle tasks can also be migrated from CPU1 to CPU0. So I
>>>> think it does not matter.
>>>
>>> What I mean is that migrating non idle tasks first is not a universal
>>> win and not always what we want.
>>>
>>
>> But migrating online tasks first is mostly a trade-off that
>> non-idle(Latency Sensitive) tasks can obtain more CPU time and minimize
>> the interference caused by IDLE tasks. I think this makes sense in most
>> cases, or you can point out what else I need to think about it ?
>>
>> Best regards.
>>
>>>>
>>>>>>
>>>>>> Testcase:
>>>>>> - Spawn large number of idle(SCHED_IDLE) tasks occupy CPUs
>>>>>
>>>>> What do you mean by a large number ?
>>>>>
>>>>>> - Let non-idle tasks compete with idle tasks for CPU time.
>>>>>>
>>>>>> Using schbench to test non-idle tasks latency:
>>>>>> $ ./schbench -m 1 -t 10 -r 30 -R 200
>>>>>
>>>>> How many CPUs do you have ?
>>>>>
>>>>
>>>> OK, some details may not be mentioned.
>>>> My virtual machine has 8 CPUs running with a schbench process and 5000
>>>> idle tasks. The idle task is a while dead loop process below:
>>>
>>> How can you care about latency when you start 10 workers on 8 vCPUs
>>> with 5000 non idle threads ?
>>>
>>
>> No no no... spawn 5000 idle(SCHED_IDLE) processes not 5000 non-idle
>> threads, and with 10 non-idle schbench workers on 8 vCPUs.
> 
> yes spawn 5000 idle tasks but my point remains the same
> 

But I really don't understand what you are most focused on, and what 
else should I do.

>>
>>>>
>>>> $ cat idle_process.c
>>>> int main()
>>>> {
>>>>            int i = 0;
>>>>            while(1) {
>>>>                    usleep(500);
>>>>                    for(i = 0; i < 1000000; i++);
>>>>            }
>>>> }
>>>>
>>>> You can compile and spawn 5000 idle(SCHED_IDLE) tasks occupying 8 CPUs
>>>> and execute schbench command to test it.
>>>>
>>>>>>
>>>>>> Test result:
>>>>>> 1.Default behavior
>>>>>> Latency percentiles (usec) runtime 30 (s) (4562 total samples)
>>>>>>            50.0th: 62528 (2281 samples)
>>>>>>            75.0th: 623616 (1141 samples)
>>>>>>            90.0th: 764928 (687 samples)
>>>>>>            95.0th: 824320 (225 samples)
>>>>>>            *99.0th: 920576 (183 samples)
>>>>>>            99.5th: 953344 (23 samples)
>>>>>>            99.9th: 1008640 (18 samples)
>>>>>>            min=9, max=1074466
>>>>>>
>>>>>> 2.Enable priority load balance
>>>>>> Latency percentiles (usec) runtime 30 (s) (4391 total samples)
>>>>>>            50.0th: 22624 (2204 samples)
>>>>>>            75.0th: 48832 (1092 samples)
>>>>>>            90.0th: 85376 (657 samples)
>>>>>>            95.0th: 113280 (220 samples)
>>>>>>            *99.0th: 182528 (175 samples)
>>>>>>            99.5th: 206592 (22 samples)
>>>>>>            99.9th: 290304 (17 samples)
>>>>>>            min=6, max=351815
>>>>>>
>>>>>>    From percentile details, we see the benefit of priority load balance
>>>>>> that 95% of non-idle tasks latencies stays no more than 113ms, while
>>>>>
>>>>> But even 113ms seems quite a large number if there is anything else
>>>>> but 10 schbench workers and a bunch of idle threads that are running.
>>>>>
>>>>>> non-idle tasks latencies has got almost 50% over 600ms if priority
>>>>>> load balance not enabled.
>>>>>
>>>>> Als have you considered enabling sched_feature LB_MIN ?
>>>>>
>>>>
>>>> I have tried to echo LB_MIN > /sys/kernel/debug/sched/features, but this
>>>> feature seems make no sense.
>>>>
>>>>>>
>>>>>> Signed-off-by: Song Zhang <zhangsong34@huawei.com>
>>>>>> ---
>>>>>>     include/linux/sched/sysctl.h |  4 +++
>>>>>>     init/Kconfig                 | 10 ++++++
>>>>>>     kernel/sched/core.c          |  3 ++
>>>>>>     kernel/sched/fair.c          | 61 +++++++++++++++++++++++++++++++++++-
>>>>>>     kernel/sched/sched.h         |  3 ++
>>>>>>     kernel/sysctl.c              | 11 +++++++
>>>>>>     6 files changed, 91 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
>>>>>> index 303ee7dd0c7e..9b3673269ecc 100644
>>>>>> --- a/include/linux/sched/sysctl.h
>>>>>> +++ b/include/linux/sched/sysctl.h
>>>>>> @@ -32,6 +32,10 @@ extern unsigned int sysctl_numa_balancing_promote_rate_limit;
>>>>>>     #define sysctl_numa_balancing_mode     0
>>>>>>     #endif
>>>>>>
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +extern unsigned int sysctl_sched_prio_load_balance_enabled;
>>>>>> +#endif
>>>>>> +
>>>>>>     int sysctl_numa_balancing(struct ctl_table *table, int write, void *buffer,
>>>>>>                    size_t *lenp, loff_t *ppos);
>>>>>>
>>>>>> diff --git a/init/Kconfig b/init/Kconfig
>>>>>> index 694f7c160c9c..b0dfe6701218 100644
>>>>>> --- a/init/Kconfig
>>>>>> +++ b/init/Kconfig
>>>>>> @@ -1026,6 +1026,16 @@ config CFS_BANDWIDTH
>>>>>>              restriction.
>>>>>>              See Documentation/scheduler/sched-bwc.rst for more information.
>>>>>>
>>>>>> +config SCHED_PRIO_LB
>>>>>> +       bool "Priority load balance for CFS"
>>>>>> +       depends on SMP
>>>>>> +       default n
>>>>>> +       help
>>>>>> +         This feature enable CFS priority load balance to reduce
>>>>>> +         non-idle tasks latency interferenced by SCHED_IDLE tasks.
>>>>>> +         It prefer migrating non-idle tasks firstly and
>>>>>> +         migrating SCHED_IDLE tasks lastly.
>>>>>> +
>>>>>>     config RT_GROUP_SCHED
>>>>>>            bool "Group scheduling for SCHED_RR/FIFO"
>>>>>>            depends on CGROUP_SCHED
>>>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>>>> index 5800b0623ff3..9be35431fdd5 100644
>>>>>> --- a/kernel/sched/core.c
>>>>>> +++ b/kernel/sched/core.c
>>>>>> @@ -9731,6 +9731,9 @@ void __init sched_init(void)
>>>>>>                    rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>>>>>>
>>>>>>                    INIT_LIST_HEAD(&rq->cfs_tasks);
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +               INIT_LIST_HEAD(&rq->cfs_idle_tasks);
>>>>>> +#endif
>>>>>>
>>>>>>                    rq_attach_root(rq, &def_root_domain);
>>>>>>     #ifdef CONFIG_NO_HZ_COMMON
>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>> index e4a0b8bd941c..bdeb04324f0c 100644
>>>>>> --- a/kernel/sched/fair.c
>>>>>> +++ b/kernel/sched/fair.c
>>>>>> @@ -139,6 +139,10 @@ static int __init setup_sched_thermal_decay_shift(char *str)
>>>>>>     }
>>>>>>     __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
>>>>>>
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +unsigned int sysctl_sched_prio_load_balance_enabled;
>>>>>> +#endif
>>>>>> +
>>>>>>     #ifdef CONFIG_SMP
>>>>>>     /*
>>>>>>      * For asym packing, by default the lower numbered CPU has higher priority.
>>>>>> @@ -3199,6 +3203,21 @@ static inline void update_scan_period(struct task_struct *p, int new_cpu)
>>>>>>
>>>>>>     #endif /* CONFIG_NUMA_BALANCING */
>>>>>>
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +static void
>>>>>> +adjust_rq_cfs_tasks(
>>>>>> +       void (*list_op)(struct list_head *, struct list_head *),
>>>>>> +       struct rq *rq,
>>>>>> +       struct sched_entity *se)
>>>>>> +{
>>>>>> +       if (sysctl_sched_prio_load_balance_enabled &&
>>>>>> +               task_has_idle_policy(task_of(se)))
>>>>>> +               (*list_op)(&se->group_node, &rq->cfs_idle_tasks);
>>>>>> +       else
>>>>>> +               (*list_op)(&se->group_node, &rq->cfs_tasks);
>>>>>> +}
>>>>>> +#endif
>>>>>> +
>>>>>>     static void
>>>>>>     account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>>>>     {
>>>>>> @@ -3208,7 +3227,11 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>>>>                    struct rq *rq = rq_of(cfs_rq);
>>>>>>
>>>>>>                    account_numa_enqueue(rq, task_of(se));
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +               adjust_rq_cfs_tasks(list_add, rq, se);
>>>>>> +#else
>>>>>>                    list_add(&se->group_node, &rq->cfs_tasks);
>>>>>> +#endif
>>>>>>            }
>>>>>>     #endif
>>>>>>            cfs_rq->nr_running++;
>>>>>> @@ -7631,7 +7654,11 @@ done: __maybe_unused;
>>>>>>             * the list, so our cfs_tasks list becomes MRU
>>>>>>             * one.
>>>>>>             */
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +       adjust_rq_cfs_tasks(list_move, rq, &p->se);
>>>>>> +#else
>>>>>>            list_move(&p->se.group_node, &rq->cfs_tasks);
>>>>>> +#endif
>>>>>>     #endif
>>>>>>
>>>>>>            if (hrtick_enabled_fair(rq))
>>>>>> @@ -8156,11 +8183,18 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
>>>>>>     static struct task_struct *detach_one_task(struct lb_env *env)
>>>>>>     {
>>>>>>            struct task_struct *p;
>>>>>> +       struct list_head *tasks = &env->src_rq->cfs_tasks;
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +       bool has_detach_idle_tasks = false;
>>>>>> +#endif
>>>>>>
>>>>>>            lockdep_assert_rq_held(env->src_rq);
>>>>>>
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +again:
>>>>>> +#endif
>>>>>>            list_for_each_entry_reverse(p,
>>>>>> -                       &env->src_rq->cfs_tasks, se.group_node) {
>>>>>> +                       tasks, se.group_node) {
>>>>>>                    if (!can_migrate_task(p, env))
>>>>>>                            continue;
>>>>>>
>>>>>> @@ -8175,6 +8209,13 @@ static struct task_struct *detach_one_task(struct lb_env *env)
>>>>>>                    schedstat_inc(env->sd->lb_gained[env->idle]);
>>>>>>                    return p;
>>>>>>            }
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +       if (sysctl_sched_prio_load_balance_enabled && !has_detach_idle_tasks) {
>>>>>> +               has_detach_idle_tasks = true;
>>>>>> +               tasks = &env->src_rq->cfs_idle_tasks;
>>>>>> +               goto again;
>>>>>> +       }
>>>>>> +#endif
>>>>>>            return NULL;
>>>>>>     }
>>>>>>
>>>>>> @@ -8190,6 +8231,9 @@ static int detach_tasks(struct lb_env *env)
>>>>>>            unsigned long util, load;
>>>>>>            struct task_struct *p;
>>>>>>            int detached = 0;
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +       bool has_detach_idle_tasks = false;
>>>>>> +#endif
>>>>>>
>>>>>>            lockdep_assert_rq_held(env->src_rq);
>>>>>>
>>>>>> @@ -8205,6 +8249,9 @@ static int detach_tasks(struct lb_env *env)
>>>>>>            if (env->imbalance <= 0)
>>>>>>                    return 0;
>>>>>>
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +again:
>>>>>> +#endif
>>>>>>            while (!list_empty(tasks)) {
>>>>>>                    /*
>>>>>>                     * We don't want to steal all, otherwise we may be treated likewise,
>>>>>> @@ -8310,6 +8357,14 @@ static int detach_tasks(struct lb_env *env)
>>>>>>                    list_move(&p->se.group_node, tasks);
>>>>>>            }
>>>>>>
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +       if (sysctl_sched_prio_load_balance_enabled &&
>>>>>> +               !has_detach_idle_tasks && env->imbalance > 0) {
>>>>>> +               has_detach_idle_tasks = true;
>>>>>> +               tasks = &env->src_rq->cfs_idle_tasks;
>>>>>> +               goto again;
>>>>>> +       }
>>>>>> +#endif
>>>>>>            /*
>>>>>>             * Right now, this is one of only two places we collect this stat
>>>>>>             * so we can safely collect detach_one_task() stats here rather
>>>>>> @@ -11814,7 +11869,11 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
>>>>>>                     * Move the next running task to the front of the list, so our
>>>>>>                     * cfs_tasks list becomes MRU one.
>>>>>>                     */
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +               adjust_rq_cfs_tasks(list_move, rq, se);
>>>>>> +#else
>>>>>>                    list_move(&se->group_node, &rq->cfs_tasks);
>>>>>> +#endif
>>>>>>            }
>>>>>>     #endif
>>>>>>
>>>>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>>>>> index 1644242ecd11..1b831c05ba30 100644
>>>>>> --- a/kernel/sched/sched.h
>>>>>> +++ b/kernel/sched/sched.h
>>>>>> @@ -1053,6 +1053,9 @@ struct rq {
>>>>>>            int                     online;
>>>>>>
>>>>>>            struct list_head cfs_tasks;
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +       struct list_head cfs_idle_tasks;
>>>>>> +#endif
>>>>>>
>>>>>>            struct sched_avg        avg_rt;
>>>>>>            struct sched_avg        avg_dl;
>>>>>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
>>>>>> index 188c305aeb8b..5fc0f9ffb675 100644
>>>>>> --- a/kernel/sysctl.c
>>>>>> +++ b/kernel/sysctl.c
>>>>>> @@ -2090,6 +2090,17 @@ static struct ctl_table kern_table[] = {
>>>>>>                    .extra1         = SYSCTL_ONE,
>>>>>>                    .extra2         = SYSCTL_INT_MAX,
>>>>>>            },
>>>>>> +#endif
>>>>>> +#ifdef CONFIG_SCHED_PRIO_LB
>>>>>> +       {
>>>>>> +               .procname       = "sched_prio_load_balance_enabled",
>>>>>> +               .data           = &sysctl_sched_prio_load_balance_enabled,
>>>>>> +               .maxlen         = sizeof(unsigned int),
>>>>>> +               .mode           = 0644,
>>>>>> +               .proc_handler   = proc_dointvec_minmax,
>>>>>> +               .extra1         = SYSCTL_ZERO,
>>>>>> +               .extra2         = SYSCTL_ONE,
>>>>>> +       },
>>>>>>     #endif
>>>>>>            { }
>>>>>>     };
>>>>>> --
>>>>>> 2.27.0
>>>>>>
>>>>> .
>>> .
> .

Song Zhang Nov. 12, 2022, 2:51 a.m. UTC | #8

Hi, Vincent

On 2022/11/3 17:22, Vincent Guittot wrote:
> On Thu, 3 Nov 2022 at 10:20, Song Zhang <zhangsong34@huawei.com> wrote:
>>
>>
>>
>> On 2022/11/3 16:33, Vincent Guittot wrote:
>>> On Thu, 3 Nov 2022 at 04:01, Song Zhang <zhangsong34@huawei.com> wrote:
>>>>
>>>> Thanks for your reply!
>>>>
>>>> On 2022/11/3 2:01, Vincent Guittot wrote:
>>>>> On Wed, 2 Nov 2022 at 04:54, Song Zhang <zhangsong34@huawei.com> wrote:
>>>>>>
>>>>>
>>>>> This really looks like a v3 of
>>>>> https://lore.kernel.org/all/20220810015636.3865248-1-zhangsong34@huawei.com/
>>>>>
>>>>> Please keep versioning.
>>>>>
>>>>>> Add a new sysctl interface:
>>>>>> /proc/sys/kernel/sched_prio_load_balance_enabled
>>>>>
>>>>> We don't want to add more sysctl knobs for the scheduler, we even
>>>>> removed some. Knob usually means that you want to fix your use case
>>>>> but the solution doesn't make sense for all cases.
>>>>>
>>>>
>>>> OK, I will remove this knobs later.
>>>>
>>>>>>
>>>>>> 0: default behavior
>>>>>> 1: enable priority load balance for CFS
>>>>>>
>>>>>> For co-location with idle and non-idle tasks, when CFS do load balance,
>>>>>> it is reasonable to prefer migrating non-idle tasks and migrating idle
>>>>>> tasks lastly. This will reduce the interference by SCHED_IDLE tasks
>>>>>> as much as possible.
>>>>>
>>>>> I don't agree that it's always the best choice to migrate a non-idle task 1st.
>>>>>
>>>>> CPU0 has 1 non idle task and CPU1 has 1 non idle task and hundreds of
>>>>> idle task and there is an imbalance between the 2 CPUS: migrating the
>>>>> non idle task from CPU1 to CPU0 is not the best choice
>>>>>
>>>>
>>>> If the non idle task on CPU1 is running or cache hot, it cannot be
>>>> migrated and idle tasks can also be migrated from CPU1 to CPU0. So I
>>>> think it does not matter.
>>>
>>> What I mean is that migrating non idle tasks first is not a universal
>>> win and not always what we want.
>>>
>>
>> But migrating online tasks first is mostly a trade-off that
>> non-idle(Latency Sensitive) tasks can obtain more CPU time and minimize
>> the interference caused by IDLE tasks. I think this makes sense in most
>> cases, or you can point out what else I need to think about it ?
>>
>> Best regards.
>>
>>>>
>>>>>>
>>>>>> Testcase:
>>>>>> - Spawn large number of idle(SCHED_IDLE) tasks occupy CPUs
>>>>>
>>>>> What do you mean by a large number ?
>>>>>
>>>>>> - Let non-idle tasks compete with idle tasks for CPU time.
>>>>>>
>>>>>> Using schbench to test non-idle tasks latency:
>>>>>> $ ./schbench -m 1 -t 10 -r 30 -R 200
>>>>>
>>>>> How many CPUs do you have ?
>>>>>
>>>>
>>>> OK, some details may not be mentioned.
>>>> My virtual machine has 8 CPUs running with a schbench process and 5000
>>>> idle tasks. The idle task is a while dead loop process below:
>>>
>>> How can you care about latency when you start 10 workers on 8 vCPUs
>>> with 5000 non idle threads ?
>>>
>>
>> No no no... spawn 5000 idle(SCHED_IDLE) processes not 5000 non-idle
>> threads, and with 10 non-idle schbench workers on 8 vCPUs.
> 
> yes spawn 5000 idle tasks but my point remains the same
> 

I am so sorry that I have not received your reply for a long time, and I 
am still waiting for it anxiously. In fact, migrating non-idle tasks 1st 
works well in most scenarios, so it maybe possible to add a 
sched_feat(LB_PRIO) to enable or disable that. Finally, I really hope 
you can give me some better advice.

Best regards.

Song Zhang

Vincent Guittot Nov. 14, 2022, 4:42 p.m. UTC | #9

On Sat, 12 Nov 2022 at 03:51, Song Zhang <zhangsong34@huawei.com> wrote:
>
> Hi, Vincent
>
> On 2022/11/3 17:22, Vincent Guittot wrote:
> > On Thu, 3 Nov 2022 at 10:20, Song Zhang <zhangsong34@huawei.com> wrote:
> >>
> >>
> >>
> >> On 2022/11/3 16:33, Vincent Guittot wrote:
> >>> On Thu, 3 Nov 2022 at 04:01, Song Zhang <zhangsong34@huawei.com> wrote:
> >>>>
> >>>> Thanks for your reply!
> >>>>
> >>>> On 2022/11/3 2:01, Vincent Guittot wrote:
> >>>>> On Wed, 2 Nov 2022 at 04:54, Song Zhang <zhangsong34@huawei.com> wrote:
> >>>>>>
> >>>>>
> >>>>> This really looks like a v3 of
> >>>>> https://lore.kernel.org/all/20220810015636.3865248-1-zhangsong34@huawei.com/
> >>>>>
> >>>>> Please keep versioning.
> >>>>>
> >>>>>> Add a new sysctl interface:
> >>>>>> /proc/sys/kernel/sched_prio_load_balance_enabled
> >>>>>
> >>>>> We don't want to add more sysctl knobs for the scheduler, we even
> >>>>> removed some. Knob usually means that you want to fix your use case
> >>>>> but the solution doesn't make sense for all cases.
> >>>>>
> >>>>
> >>>> OK, I will remove this knobs later.
> >>>>
> >>>>>>
> >>>>>> 0: default behavior
> >>>>>> 1: enable priority load balance for CFS
> >>>>>>
> >>>>>> For co-location with idle and non-idle tasks, when CFS do load balance,
> >>>>>> it is reasonable to prefer migrating non-idle tasks and migrating idle
> >>>>>> tasks lastly. This will reduce the interference by SCHED_IDLE tasks
> >>>>>> as much as possible.
> >>>>>
> >>>>> I don't agree that it's always the best choice to migrate a non-idle task 1st.
> >>>>>
> >>>>> CPU0 has 1 non idle task and CPU1 has 1 non idle task and hundreds of
> >>>>> idle task and there is an imbalance between the 2 CPUS: migrating the
> >>>>> non idle task from CPU1 to CPU0 is not the best choice
> >>>>>
> >>>>
> >>>> If the non idle task on CPU1 is running or cache hot, it cannot be
> >>>> migrated and idle tasks can also be migrated from CPU1 to CPU0. So I
> >>>> think it does not matter.
> >>>
> >>> What I mean is that migrating non idle tasks first is not a universal
> >>> win and not always what we want.
> >>>
> >>
> >> But migrating online tasks first is mostly a trade-off that
> >> non-idle(Latency Sensitive) tasks can obtain more CPU time and minimize
> >> the interference caused by IDLE tasks. I think this makes sense in most
> >> cases, or you can point out what else I need to think about it ?
> >>
> >> Best regards.
> >>
> >>>>
> >>>>>>
> >>>>>> Testcase:
> >>>>>> - Spawn large number of idle(SCHED_IDLE) tasks occupy CPUs
> >>>>>
> >>>>> What do you mean by a large number ?
> >>>>>
> >>>>>> - Let non-idle tasks compete with idle tasks for CPU time.
> >>>>>>
> >>>>>> Using schbench to test non-idle tasks latency:
> >>>>>> $ ./schbench -m 1 -t 10 -r 30 -R 200
> >>>>>
> >>>>> How many CPUs do you have ?
> >>>>>
> >>>>
> >>>> OK, some details may not be mentioned.
> >>>> My virtual machine has 8 CPUs running with a schbench process and 5000
> >>>> idle tasks. The idle task is a while dead loop process below:
> >>>
> >>> How can you care about latency when you start 10 workers on 8 vCPUs
> >>> with 5000 non idle threads ?
> >>>
> >>
> >> No no no... spawn 5000 idle(SCHED_IDLE) processes not 5000 non-idle
> >> threads, and with 10 non-idle schbench workers on 8 vCPUs.
> >
> > yes spawn 5000 idle tasks but my point remains the same
> >
>
> I am so sorry that I have not received your reply for a long time, and I
> am still waiting for it anxiously. In fact, migrating non-idle tasks 1st
> works well in most scenarios, so it maybe possible to add a
> sched_feat(LB_PRIO) to enable or disable that. Finally, I really hope
> you can give me some better advice.

I have seen that you posted a v4 5 days ago which is on my list to be reviewed.

My concern here remains that selecting non idle task 1st is not always
the best choices as for example when you have 1 non idle task per cpu
and thousands of idle tasks moving around. Then regarding your use
case, the weight of the 5000 idle threads is around twice more than
the weight of your non idle bench: sum weight of idle threads is 15k
whereas the weight of your bench is around 6k IIUC how RPS run. This
also means that the idle threads will take a significant times of the
system: 5000 / 7000 ticks. I don't understand how you can care about
latency in such extreme case and I'm interested to get the real use
case where you can have such situation.

All that to say that idle task remains cfs task with a small but not
null weight and we should not make them special other than by not
preempting at wakeup.

>
> Best regards.
>
> Song Zhang

Vincent Guittot Nov. 15, 2022, 7:18 a.m. UTC | #10

On Mon, 14 Nov 2022 at 17:42, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Sat, 12 Nov 2022 at 03:51, Song Zhang <zhangsong34@huawei.com> wrote:
> >
> > Hi, Vincent
> >
> > On 2022/11/3 17:22, Vincent Guittot wrote:
> > > On Thu, 3 Nov 2022 at 10:20, Song Zhang <zhangsong34@huawei.com> wrote:
> > >>
> > >>
> > >>
> > >> On 2022/11/3 16:33, Vincent Guittot wrote:
> > >>> On Thu, 3 Nov 2022 at 04:01, Song Zhang <zhangsong34@huawei.com> wrote:
> > >>>>
> > >>>> Thanks for your reply!
> > >>>>
> > >>>> On 2022/11/3 2:01, Vincent Guittot wrote:
> > >>>>> On Wed, 2 Nov 2022 at 04:54, Song Zhang <zhangsong34@huawei.com> wrote:
> > >>>>>>
> > >>>>>
> > >>>>> This really looks like a v3 of
> > >>>>> https://lore.kernel.org/all/20220810015636.3865248-1-zhangsong34@huawei.com/
> > >>>>>
> > >>>>> Please keep versioning.
> > >>>>>
> > >>>>>> Add a new sysctl interface:
> > >>>>>> /proc/sys/kernel/sched_prio_load_balance_enabled
> > >>>>>
> > >>>>> We don't want to add more sysctl knobs for the scheduler, we even
> > >>>>> removed some. Knob usually means that you want to fix your use case
> > >>>>> but the solution doesn't make sense for all cases.
> > >>>>>
> > >>>>
> > >>>> OK, I will remove this knobs later.
> > >>>>
> > >>>>>>
> > >>>>>> 0: default behavior
> > >>>>>> 1: enable priority load balance for CFS
> > >>>>>>
> > >>>>>> For co-location with idle and non-idle tasks, when CFS do load balance,
> > >>>>>> it is reasonable to prefer migrating non-idle tasks and migrating idle
> > >>>>>> tasks lastly. This will reduce the interference by SCHED_IDLE tasks
> > >>>>>> as much as possible.
> > >>>>>
> > >>>>> I don't agree that it's always the best choice to migrate a non-idle task 1st.
> > >>>>>
> > >>>>> CPU0 has 1 non idle task and CPU1 has 1 non idle task and hundreds of
> > >>>>> idle task and there is an imbalance between the 2 CPUS: migrating the
> > >>>>> non idle task from CPU1 to CPU0 is not the best choice
> > >>>>>
> > >>>>
> > >>>> If the non idle task on CPU1 is running or cache hot, it cannot be
> > >>>> migrated and idle tasks can also be migrated from CPU1 to CPU0. So I
> > >>>> think it does not matter.
> > >>>
> > >>> What I mean is that migrating non idle tasks first is not a universal
> > >>> win and not always what we want.
> > >>>
> > >>
> > >> But migrating online tasks first is mostly a trade-off that
> > >> non-idle(Latency Sensitive) tasks can obtain more CPU time and minimize
> > >> the interference caused by IDLE tasks. I think this makes sense in most
> > >> cases, or you can point out what else I need to think about it ?
> > >>
> > >> Best regards.
> > >>
> > >>>>
> > >>>>>>
> > >>>>>> Testcase:
> > >>>>>> - Spawn large number of idle(SCHED_IDLE) tasks occupy CPUs
> > >>>>>
> > >>>>> What do you mean by a large number ?
> > >>>>>
> > >>>>>> - Let non-idle tasks compete with idle tasks for CPU time.
> > >>>>>>
> > >>>>>> Using schbench to test non-idle tasks latency:
> > >>>>>> $ ./schbench -m 1 -t 10 -r 30 -R 200
> > >>>>>
> > >>>>> How many CPUs do you have ?
> > >>>>>
> > >>>>
> > >>>> OK, some details may not be mentioned.
> > >>>> My virtual machine has 8 CPUs running with a schbench process and 5000
> > >>>> idle tasks. The idle task is a while dead loop process below:
> > >>>
> > >>> How can you care about latency when you start 10 workers on 8 vCPUs
> > >>> with 5000 non idle threads ?
> > >>>
> > >>
> > >> No no no... spawn 5000 idle(SCHED_IDLE) processes not 5000 non-idle
> > >> threads, and with 10 non-idle schbench workers on 8 vCPUs.
> > >
> > > yes spawn 5000 idle tasks but my point remains the same
> > >
> >
> > I am so sorry that I have not received your reply for a long time, and I
> > am still waiting for it anxiously. In fact, migrating non-idle tasks 1st
> > works well in most scenarios, so it maybe possible to add a
> > sched_feat(LB_PRIO) to enable or disable that. Finally, I really hope
> > you can give me some better advice.
>
> I have seen that you posted a v4 5 days ago which is on my list to be reviewed.
>
> My concern here remains that selecting non idle task 1st is not always
> the best choices as for example when you have 1 non idle task per cpu
> and thousands of idle tasks moving around. Then regarding your use
> case, the weight of the 5000 idle threads is around twice more than
> the weight of your non idle bench: sum weight of idle threads is 15k
> whereas the weight of your bench is around 6k IIUC how RPS run. This
> also means that the idle threads will take a significant times of the
> system: 5000 / 7000 ticks. I don't understand how you can care about
> latency in such extreme case and I'm interested to get the real use
> case where you can have such situation.
>
> All that to say that idle task remains cfs task with a small but not
> null weight and we should not make them special other than by not
> preempting at wakeup.

Also, as mentioned for a previous version, a task with nice prio 19
has a weight of 15 so if you replace the 5k idle threads with 1k cfs
w/ nice prio 19 threads, you will face a similar problem. So you can't
really care only on the idle property of a task

>
> >
> > Best regards.
> >
> > Song Zhang

Song Zhang Nov. 16, 2022, 7:37 a.m. UTC | #11

On 2022/11/15 15:18, Vincent Guittot wrote:
> On Mon, 14 Nov 2022 at 17:42, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
>>
>> On Sat, 12 Nov 2022 at 03:51, Song Zhang <zhangsong34@huawei.com> wrote:
>>>
>>> Hi, Vincent
>>>
>>> On 2022/11/3 17:22, Vincent Guittot wrote:
>>>> On Thu, 3 Nov 2022 at 10:20, Song Zhang <zhangsong34@huawei.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2022/11/3 16:33, Vincent Guittot wrote:
>>>>>> On Thu, 3 Nov 2022 at 04:01, Song Zhang <zhangsong34@huawei.com> wrote:
>>>>>>>
>>>>>>> Thanks for your reply!
>>>>>>>
>>>>>>> On 2022/11/3 2:01, Vincent Guittot wrote:
>>>>>>>> On Wed, 2 Nov 2022 at 04:54, Song Zhang <zhangsong34@huawei.com> wrote:
>>>>>>>>>
>>>>>>>>
>>>>>>>> This really looks like a v3 of
>>>>>>>> https://lore.kernel.org/all/20220810015636.3865248-1-zhangsong34@huawei.com/
>>>>>>>>
>>>>>>>> Please keep versioning.
>>>>>>>>
>>>>>>>>> Add a new sysctl interface:
>>>>>>>>> /proc/sys/kernel/sched_prio_load_balance_enabled
>>>>>>>>
>>>>>>>> We don't want to add more sysctl knobs for the scheduler, we even
>>>>>>>> removed some. Knob usually means that you want to fix your use case
>>>>>>>> but the solution doesn't make sense for all cases.
>>>>>>>>
>>>>>>>
>>>>>>> OK, I will remove this knobs later.
>>>>>>>
>>>>>>>>>
>>>>>>>>> 0: default behavior
>>>>>>>>> 1: enable priority load balance for CFS
>>>>>>>>>
>>>>>>>>> For co-location with idle and non-idle tasks, when CFS do load balance,
>>>>>>>>> it is reasonable to prefer migrating non-idle tasks and migrating idle
>>>>>>>>> tasks lastly. This will reduce the interference by SCHED_IDLE tasks
>>>>>>>>> as much as possible.
>>>>>>>>
>>>>>>>> I don't agree that it's always the best choice to migrate a non-idle task 1st.
>>>>>>>>
>>>>>>>> CPU0 has 1 non idle task and CPU1 has 1 non idle task and hundreds of
>>>>>>>> idle task and there is an imbalance between the 2 CPUS: migrating the
>>>>>>>> non idle task from CPU1 to CPU0 is not the best choice
>>>>>>>>
>>>>>>>
>>>>>>> If the non idle task on CPU1 is running or cache hot, it cannot be
>>>>>>> migrated and idle tasks can also be migrated from CPU1 to CPU0. So I
>>>>>>> think it does not matter.
>>>>>>
>>>>>> What I mean is that migrating non idle tasks first is not a universal
>>>>>> win and not always what we want.
>>>>>>
>>>>>
>>>>> But migrating online tasks first is mostly a trade-off that
>>>>> non-idle(Latency Sensitive) tasks can obtain more CPU time and minimize
>>>>> the interference caused by IDLE tasks. I think this makes sense in most
>>>>> cases, or you can point out what else I need to think about it ?
>>>>>
>>>>> Best regards.
>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>> Testcase:
>>>>>>>>> - Spawn large number of idle(SCHED_IDLE) tasks occupy CPUs
>>>>>>>>
>>>>>>>> What do you mean by a large number ?
>>>>>>>>
>>>>>>>>> - Let non-idle tasks compete with idle tasks for CPU time.
>>>>>>>>>
>>>>>>>>> Using schbench to test non-idle tasks latency:
>>>>>>>>> $ ./schbench -m 1 -t 10 -r 30 -R 200
>>>>>>>>
>>>>>>>> How many CPUs do you have ?
>>>>>>>>
>>>>>>>
>>>>>>> OK, some details may not be mentioned.
>>>>>>> My virtual machine has 8 CPUs running with a schbench process and 5000
>>>>>>> idle tasks. The idle task is a while dead loop process below:
>>>>>>
>>>>>> How can you care about latency when you start 10 workers on 8 vCPUs
>>>>>> with 5000 non idle threads ?
>>>>>>
>>>>>
>>>>> No no no... spawn 5000 idle(SCHED_IDLE) processes not 5000 non-idle
>>>>> threads, and with 10 non-idle schbench workers on 8 vCPUs.
>>>>
>>>> yes spawn 5000 idle tasks but my point remains the same
>>>>
>>>
>>> I am so sorry that I have not received your reply for a long time, and I
>>> am still waiting for it anxiously. In fact, migrating non-idle tasks 1st
>>> works well in most scenarios, so it maybe possible to add a
>>> sched_feat(LB_PRIO) to enable or disable that. Finally, I really hope
>>> you can give me some better advice.
>>
>> I have seen that you posted a v4 5 days ago which is on my list to be reviewed.
>>
>> My concern here remains that selecting non idle task 1st is not always
>> the best choices as for example when you have 1 non idle task per cpu
>> and thousands of idle tasks moving around. Then regarding your use
>> case, the weight of the 5000 idle threads is around twice more than
>> the weight of your non idle bench: sum weight of idle threads is 15k
>> whereas the weight of your bench is around 6k IIUC how RPS run. This
>> also means that the idle threads will take a significant times of the
>> system: 5000 / 7000 ticks. I don't understand how you can care about
>> latency in such extreme case and I'm interested to get the real use
>> case where you can have such situation.
>>
>> All that to say that idle task remains cfs task with a small but not
>> null weight and we should not make them special other than by not
>> preempting at wakeup.
> 
> Also, as mentioned for a previous version, a task with nice prio 19
> has a weight of 15 so if you replace the 5k idle threads with 1k cfs
> w/ nice prio 19 threads, you will face a similar problem. So you can't
> really care only on the idle property of a task
> 

Well, my original idea was to consider interference between tasks of 
different priorities when doing CFS load balancing to ensure that 
non-idle tasks get more CPU scheduler time without changing the native 
CFS load balancing policy.

Consider a simple scenario. Assume that CPU 0 has two non-idle tasks 
whose weight is 1024 * 2 = 2048, also CPU 0 has 1000 idle tasks whose 
weight is 1K x 15 = 15K. CPU 1 is idle. Therefore, IDLE load balance is 
triggered. CPU 1 needs to pull a certain number of tasks from CPU 0. If 
we do not considerate task priorities and interference between tasks, 
more than 600 idle tasks on CPU 0 may be migrated to CPU 1. As a result, 
two non-idle tasks still compete on CPU 0. However, CPU 1 is running 
with all idle but not non-idle tasks.

Let's calculate the percentage of CPU time gained by non-idle tasks in a 
scheduling period:

CPU 1: time_percent(non-idle tasks) = 0
CPU 0: time_percent(non-idle tasks) = 2048 * 2 / (2048 + 15000) = 24%

On the other hand, if we consider the interference between different 
task priorities, we change the migration policy to firstly migrate an 
non-idle task on CPU 0 to CPU 1. Migrating idle tasks on CPU 0 maybe 
interfered with the non-idle task on CPU 1. So we decide to migrate idle 
tasks on CPU 0 after non-idle tasks on CPU 1 are completed or exited.

Now the percentage of the CPU time obtained by the non-idle tasks in a 
scheduling period is as follows:

CPU 1: time_percent(non-idle tasks) = 1024 / 1024 = 100%
CPU 0: time_percent(non-idle tasks) = 1024 / (1024 + 15000) = 6.4%

Obviously, if load balance migration tasks prefer migrate non-idle tasks 
and suppress the interference of idle tasks migration on non-idle tasks, 
the latency of non-idle tasks can be significantly reduced. Although 
this will cause some idle tasks imbalance between different CPUs and 
reduce throughput of idle tasks., I think this strategy is feasible in 
some real-time business scenarios for latency tasks.

>>
>>>
>>> Best regards.
>>>
>>> Song Zhang
> .

Vincent Guittot Nov. 16, 2022, 2:38 p.m. UTC | #12

On Wed, 16 Nov 2022 at 08:37, Song Zhang <zhangsong34@huawei.com> wrote:
>
>
>
> On 2022/11/15 15:18, Vincent Guittot wrote:
> > On Mon, 14 Nov 2022 at 17:42, Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> >>
> >> On Sat, 12 Nov 2022 at 03:51, Song Zhang <zhangsong34@huawei.com> wrote:
> >>>
> >>> Hi, Vincent
> >>>
> >>> On 2022/11/3 17:22, Vincent Guittot wrote:
> >>>> On Thu, 3 Nov 2022 at 10:20, Song Zhang <zhangsong34@huawei.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2022/11/3 16:33, Vincent Guittot wrote:
> >>>>>> On Thu, 3 Nov 2022 at 04:01, Song Zhang <zhangsong34@huawei.com> wrote:
> >>>>>>>
> >>>>>>> Thanks for your reply!
> >>>>>>>
> >>>>>>> On 2022/11/3 2:01, Vincent Guittot wrote:
> >>>>>>>> On Wed, 2 Nov 2022 at 04:54, Song Zhang <zhangsong34@huawei.com> wrote:
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> This really looks like a v3 of
> >>>>>>>> https://lore.kernel.org/all/20220810015636.3865248-1-zhangsong34@huawei.com/
> >>>>>>>>
> >>>>>>>> Please keep versioning.
> >>>>>>>>
> >>>>>>>>> Add a new sysctl interface:
> >>>>>>>>> /proc/sys/kernel/sched_prio_load_balance_enabled
> >>>>>>>>
> >>>>>>>> We don't want to add more sysctl knobs for the scheduler, we even
> >>>>>>>> removed some. Knob usually means that you want to fix your use case
> >>>>>>>> but the solution doesn't make sense for all cases.
> >>>>>>>>
> >>>>>>>
> >>>>>>> OK, I will remove this knobs later.
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>> 0: default behavior
> >>>>>>>>> 1: enable priority load balance for CFS
> >>>>>>>>>
> >>>>>>>>> For co-location with idle and non-idle tasks, when CFS do load balance,
> >>>>>>>>> it is reasonable to prefer migrating non-idle tasks and migrating idle
> >>>>>>>>> tasks lastly. This will reduce the interference by SCHED_IDLE tasks
> >>>>>>>>> as much as possible.
> >>>>>>>>
> >>>>>>>> I don't agree that it's always the best choice to migrate a non-idle task 1st.
> >>>>>>>>
> >>>>>>>> CPU0 has 1 non idle task and CPU1 has 1 non idle task and hundreds of
> >>>>>>>> idle task and there is an imbalance between the 2 CPUS: migrating the
> >>>>>>>> non idle task from CPU1 to CPU0 is not the best choice
> >>>>>>>>
> >>>>>>>
> >>>>>>> If the non idle task on CPU1 is running or cache hot, it cannot be
> >>>>>>> migrated and idle tasks can also be migrated from CPU1 to CPU0. So I
> >>>>>>> think it does not matter.
> >>>>>>
> >>>>>> What I mean is that migrating non idle tasks first is not a universal
> >>>>>> win and not always what we want.
> >>>>>>
> >>>>>
> >>>>> But migrating online tasks first is mostly a trade-off that
> >>>>> non-idle(Latency Sensitive) tasks can obtain more CPU time and minimize
> >>>>> the interference caused by IDLE tasks. I think this makes sense in most
> >>>>> cases, or you can point out what else I need to think about it ?
> >>>>>
> >>>>> Best regards.
> >>>>>
> >>>>>>>
> >>>>>>>>>
> >>>>>>>>> Testcase:
> >>>>>>>>> - Spawn large number of idle(SCHED_IDLE) tasks occupy CPUs
> >>>>>>>>
> >>>>>>>> What do you mean by a large number ?
> >>>>>>>>
> >>>>>>>>> - Let non-idle tasks compete with idle tasks for CPU time.
> >>>>>>>>>
> >>>>>>>>> Using schbench to test non-idle tasks latency:
> >>>>>>>>> $ ./schbench -m 1 -t 10 -r 30 -R 200
> >>>>>>>>
> >>>>>>>> How many CPUs do you have ?
> >>>>>>>>
> >>>>>>>
> >>>>>>> OK, some details may not be mentioned.
> >>>>>>> My virtual machine has 8 CPUs running with a schbench process and 5000
> >>>>>>> idle tasks. The idle task is a while dead loop process below:
> >>>>>>
> >>>>>> How can you care about latency when you start 10 workers on 8 vCPUs
> >>>>>> with 5000 non idle threads ?
> >>>>>>
> >>>>>
> >>>>> No no no... spawn 5000 idle(SCHED_IDLE) processes not 5000 non-idle
> >>>>> threads, and with 10 non-idle schbench workers on 8 vCPUs.
> >>>>
> >>>> yes spawn 5000 idle tasks but my point remains the same
> >>>>
> >>>
> >>> I am so sorry that I have not received your reply for a long time, and I
> >>> am still waiting for it anxiously. In fact, migrating non-idle tasks 1st
> >>> works well in most scenarios, so it maybe possible to add a
> >>> sched_feat(LB_PRIO) to enable or disable that. Finally, I really hope
> >>> you can give me some better advice.
> >>
> >> I have seen that you posted a v4 5 days ago which is on my list to be reviewed.
> >>
> >> My concern here remains that selecting non idle task 1st is not always
> >> the best choices as for example when you have 1 non idle task per cpu
> >> and thousands of idle tasks moving around. Then regarding your use
> >> case, the weight of the 5000 idle threads is around twice more than
> >> the weight of your non idle bench: sum weight of idle threads is 15k
> >> whereas the weight of your bench is around 6k IIUC how RPS run. This
> >> also means that the idle threads will take a significant times of the
> >> system: 5000 / 7000 ticks. I don't understand how you can care about
> >> latency in such extreme case and I'm interested to get the real use
> >> case where you can have such situation.
> >>
> >> All that to say that idle task remains cfs task with a small but not
> >> null weight and we should not make them special other than by not
> >> preempting at wakeup.
> >
> > Also, as mentioned for a previous version, a task with nice prio 19
> > has a weight of 15 so if you replace the 5k idle threads with 1k cfs
> > w/ nice prio 19 threads, you will face a similar problem. So you can't
> > really care only on the idle property of a task
> >
>
> Well, my original idea was to consider interference between tasks of
> different priorities when doing CFS load balancing to ensure that
> non-idle tasks get more CPU scheduler time without changing the native
> CFS load balancing policy.
>
> Consider a simple scenario. Assume that CPU 0 has two non-idle tasks
> whose weight is 1024 * 2 = 2048, also CPU 0 has 1000 idle tasks whose
> weight is 1K x 15 = 15K. CPU 1 is idle. Therefore, IDLE load balance is

weight of cfs idle thread is 3, the weight of cfs nice 19 thread is 15

> triggered. CPU 1 needs to pull a certain number of tasks from CPU 0. If
> we do not considerate task priorities and interference between tasks,
> more than 600 idle tasks on CPU 0 may be migrated to CPU 1. As a result,
> two non-idle tasks still compete on CPU 0. However, CPU 1 is running
> with all idle but not non-idle tasks.
>
> Let's calculate the percentage of CPU time gained by non-idle tasks in a
> scheduling period:
>
> CPU 1: time_percent(non-idle tasks) = 0
> CPU 0: time_percent(non-idle tasks) = 2048 * 2 / (2048 + 15000) = 24%

2 cfs task nice 0 with 1000 cfs idle tasks on 2 CPUs. The weight of
the system is:

2*1024 + 1000*3 = 5048 or  2524 per CPU

This means that the cfs nice 0 task should get 1024/(5048) = 20% of
system time which means 40% of CPUs time.

This also means that the 2 cfs tasks on CPU0 is a valid configuration
as they will both have their 40% of CPUs

cfs idle threads have a small weight to be negligible compared to
"normal" threads so they can't normally balance a system by themself
but by spawning 1000+ cfs idle threads, you make them not negligible
anymore. That's the root of your problem. A CPU with only cfs idle
tasks should be seen unbalanced compared to other CPUs with non idle
tasks and this is what is happening with small/normal number of cfs
idle threads

>
> On the other hand, if we consider the interference between different
> task priorities, we change the migration policy to firstly migrate an
> non-idle task on CPU 0 to CPU 1. Migrating idle tasks on CPU 0 maybe
> interfered with the non-idle task on CPU 1. So we decide to migrate idle
> tasks on CPU 0 after non-idle tasks on CPU 1 are completed or exited.
>
> Now the percentage of the CPU time obtained by the non-idle tasks in a
> scheduling period is as follows:
>
> CPU 1: time_percent(non-idle tasks) = 1024 / 1024 = 100%
> CPU 0: time_percent(non-idle tasks) = 1024 / (1024 + 15000) = 6.4%

But this is unfair for one cfs nice 0 thread and all cfs idle threads

>
> Obviously, if load balance migration tasks prefer migrate non-idle tasks
> and suppress the interference of idle tasks migration on non-idle tasks,
> the latency of non-idle tasks can be significantly reduced. Although
> this will cause some idle tasks imbalance between different CPUs and
> reduce throughput of idle tasks., I think this strategy is feasible in
> some real-time business scenarios for latency tasks.

But idle cfs ask remains cfs task and we keep cfs fairness for all threads

Have you tried to :
- Increase nice priority of the non idle cfs task so the sum of the
weight of idle tasks remain a small portion of the total weight ?
- to put your thousands idle tasks in a cgroup and set cpu.idle for
this cgroup. This should also ensure that the weight of idle threads
remains negligible compared to others.

I have tried both setup in my local system and I have 1 non idle task per CPU

Regards,
Vincent

>
> >>
> >>>
> >>> Best regards.
> >>>
> >>> Song Zhang
> > .

Song Zhang Nov. 17, 2022, 9:07 a.m. UTC | #13

On 2022/11/16 22:38, Vincent Guittot wrote:
> On Wed, 16 Nov 2022 at 08:37, Song Zhang <zhangsong34@huawei.com> wrote:
>>
>>
>>
>> On 2022/11/15 15:18, Vincent Guittot wrote:
>>> On Mon, 14 Nov 2022 at 17:42, Vincent Guittot
>>> <vincent.guittot@linaro.org> wrote:
>>>>
>>>> On Sat, 12 Nov 2022 at 03:51, Song Zhang <zhangsong34@huawei.com> wrote:
>>>>>
>>>>> Hi, Vincent
>>>>>
>>>>> On 2022/11/3 17:22, Vincent Guittot wrote:
>>>>>> On Thu, 3 Nov 2022 at 10:20, Song Zhang <zhangsong34@huawei.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2022/11/3 16:33, Vincent Guittot wrote:
>>>>>>>> On Thu, 3 Nov 2022 at 04:01, Song Zhang <zhangsong34@huawei.com> wrote:
>>>>>>>>>
>>>>>>>>> Thanks for your reply!
>>>>>>>>>
>>>>>>>>> On 2022/11/3 2:01, Vincent Guittot wrote:
>>>>>>>>>> On Wed, 2 Nov 2022 at 04:54, Song Zhang <zhangsong34@huawei.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This really looks like a v3 of
>>>>>>>>>> https://lore.kernel.org/all/20220810015636.3865248-1-zhangsong34@huawei.com/
>>>>>>>>>>
>>>>>>>>>> Please keep versioning.
>>>>>>>>>>
>>>>>>>>>>> Add a new sysctl interface:
>>>>>>>>>>> /proc/sys/kernel/sched_prio_load_balance_enabled
>>>>>>>>>>
>>>>>>>>>> We don't want to add more sysctl knobs for the scheduler, we even
>>>>>>>>>> removed some. Knob usually means that you want to fix your use case
>>>>>>>>>> but the solution doesn't make sense for all cases.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> OK, I will remove this knobs later.
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 0: default behavior
>>>>>>>>>>> 1: enable priority load balance for CFS
>>>>>>>>>>>
>>>>>>>>>>> For co-location with idle and non-idle tasks, when CFS do load balance,
>>>>>>>>>>> it is reasonable to prefer migrating non-idle tasks and migrating idle
>>>>>>>>>>> tasks lastly. This will reduce the interference by SCHED_IDLE tasks
>>>>>>>>>>> as much as possible.
>>>>>>>>>>
>>>>>>>>>> I don't agree that it's always the best choice to migrate a non-idle task 1st.
>>>>>>>>>>
>>>>>>>>>> CPU0 has 1 non idle task and CPU1 has 1 non idle task and hundreds of
>>>>>>>>>> idle task and there is an imbalance between the 2 CPUS: migrating the
>>>>>>>>>> non idle task from CPU1 to CPU0 is not the best choice
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If the non idle task on CPU1 is running or cache hot, it cannot be
>>>>>>>>> migrated and idle tasks can also be migrated from CPU1 to CPU0. So I
>>>>>>>>> think it does not matter.
>>>>>>>>
>>>>>>>> What I mean is that migrating non idle tasks first is not a universal
>>>>>>>> win and not always what we want.
>>>>>>>>
>>>>>>>
>>>>>>> But migrating online tasks first is mostly a trade-off that
>>>>>>> non-idle(Latency Sensitive) tasks can obtain more CPU time and minimize
>>>>>>> the interference caused by IDLE tasks. I think this makes sense in most
>>>>>>> cases, or you can point out what else I need to think about it ?
>>>>>>>
>>>>>>> Best regards.
>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Testcase:
>>>>>>>>>>> - Spawn large number of idle(SCHED_IDLE) tasks occupy CPUs
>>>>>>>>>>
>>>>>>>>>> What do you mean by a large number ?
>>>>>>>>>>
>>>>>>>>>>> - Let non-idle tasks compete with idle tasks for CPU time.
>>>>>>>>>>>
>>>>>>>>>>> Using schbench to test non-idle tasks latency:
>>>>>>>>>>> $ ./schbench -m 1 -t 10 -r 30 -R 200
>>>>>>>>>>
>>>>>>>>>> How many CPUs do you have ?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> OK, some details may not be mentioned.
>>>>>>>>> My virtual machine has 8 CPUs running with a schbench process and 5000
>>>>>>>>> idle tasks. The idle task is a while dead loop process below:
>>>>>>>>
>>>>>>>> How can you care about latency when you start 10 workers on 8 vCPUs
>>>>>>>> with 5000 non idle threads ?
>>>>>>>>
>>>>>>>
>>>>>>> No no no... spawn 5000 idle(SCHED_IDLE) processes not 5000 non-idle
>>>>>>> threads, and with 10 non-idle schbench workers on 8 vCPUs.
>>>>>>
>>>>>> yes spawn 5000 idle tasks but my point remains the same
>>>>>>
>>>>>
>>>>> I am so sorry that I have not received your reply for a long time, and I
>>>>> am still waiting for it anxiously. In fact, migrating non-idle tasks 1st
>>>>> works well in most scenarios, so it maybe possible to add a
>>>>> sched_feat(LB_PRIO) to enable or disable that. Finally, I really hope
>>>>> you can give me some better advice.
>>>>
>>>> I have seen that you posted a v4 5 days ago which is on my list to be reviewed.
>>>>
>>>> My concern here remains that selecting non idle task 1st is not always
>>>> the best choices as for example when you have 1 non idle task per cpu
>>>> and thousands of idle tasks moving around. Then regarding your use
>>>> case, the weight of the 5000 idle threads is around twice more than
>>>> the weight of your non idle bench: sum weight of idle threads is 15k
>>>> whereas the weight of your bench is around 6k IIUC how RPS run. This
>>>> also means that the idle threads will take a significant times of the
>>>> system: 5000 / 7000 ticks. I don't understand how you can care about
>>>> latency in such extreme case and I'm interested to get the real use
>>>> case where you can have such situation.
>>>>
>>>> All that to say that idle task remains cfs task with a small but not
>>>> null weight and we should not make them special other than by not
>>>> preempting at wakeup.
>>>
>>> Also, as mentioned for a previous version, a task with nice prio 19
>>> has a weight of 15 so if you replace the 5k idle threads with 1k cfs
>>> w/ nice prio 19 threads, you will face a similar problem. So you can't
>>> really care only on the idle property of a task
>>>
>>
>> Well, my original idea was to consider interference between tasks of
>> different priorities when doing CFS load balancing to ensure that
>> non-idle tasks get more CPU scheduler time without changing the native
>> CFS load balancing policy.
>>
>> Consider a simple scenario. Assume that CPU 0 has two non-idle tasks
>> whose weight is 1024 * 2 = 2048, also CPU 0 has 1000 idle tasks whose
>> weight is 1K x 15 = 15K. CPU 1 is idle. Therefore, IDLE load balance is
> 
> weight of cfs idle thread is 3, the weight of cfs nice 19 thread is 15

yes, idle weight is 3, thanks for your pointing out.

> 
>> triggered. CPU 1 needs to pull a certain number of tasks from CPU 0. If
>> we do not considerate task priorities and interference between tasks,
>> more than 600 idle tasks on CPU 0 may be migrated to CPU 1. As a result,
>> two non-idle tasks still compete on CPU 0. However, CPU 1 is running
>> with all idle but not non-idle tasks.
>>
>> Let's calculate the percentage of CPU time gained by non-idle tasks in a
>> scheduling period:
>>
>> CPU 1: time_percent(non-idle tasks) = 0
>> CPU 0: time_percent(non-idle tasks) = 2048 * 2 / (2048 + 15000) = 24%
> 
> 2 cfs task nice 0 with 1000 cfs idle tasks on 2 CPUs. The weight of
> the system is:
> 
> 2*1024 + 1000*3 = 5048 or  2524 per CPU
> 
> This means that the cfs nice 0 task should get 1024/(5048) = 20% of
> system time which means 40% of CPUs time.
> 
> This also means that the 2 cfs tasks on CPU0 is a valid configuration
> as they will both have their 40% of CPUs
> 

If you increase idle task number to 3000, the cfs nice 0 task only get 
1024 / (2 * 1024 + 3000 * 3) = 9.3% of system time.

But if we can first migrate one cfs nice 0 task to CPU 1, the cfs nice 0 
task maybe execute quickly on CPU 1, then CPU 1 is got to idle and pulls 
more idle tasks from CPU 0, so that the cfs nice 0 task on CPU 0 can 
also be completed more quickly.

> cfs idle threads have a small weight to be negligible compared to
> "normal" threads so they can't normally balance a system by themself
> but by spawning 1000+ cfs idle threads, you make them not negligible
> anymore. That's the root of your problem. A CPU with only cfs idle
> tasks should be seen unbalanced compared to other CPUs with non idle
> tasks and this is what is happening with small/normal number of cfs
> idle threads
> 

If we do not consider putting all low-priority tasks to a cgroup with a 
minimum cpu shares and only set per-task scheduler policy to SCHED_IDLE, 
the weight of a large number of idle tasks cannot be ignored.

>>
>> On the other hand, if we consider the interference between different
>> task priorities, we change the migration policy to firstly migrate an
>> non-idle task on CPU 0 to CPU 1. Migrating idle tasks on CPU 0 maybe
>> interfered with the non-idle task on CPU 1. So we decide to migrate idle
>> tasks on CPU 0 after non-idle tasks on CPU 1 are completed or exited.
>>
>> Now the percentage of the CPU time obtained by the non-idle tasks in a
>> scheduling period is as follows:
>>
>> CPU 1: time_percent(non-idle tasks) = 1024 / 1024 = 100%
>> CPU 0: time_percent(non-idle tasks) = 1024 / (1024 + 15000) = 6.4%
> 
> But this is unfair for one cfs nice 0 thread and all cfs idle threads
> 

This unfairness may be short-lived, because as soon as CPU 1 go to idle 
again, CPU 1 immediately pulls more idle tasks from CPU 0 to accelerate 
the running of non-idle tasks on CPU 0.

>>
>> Obviously, if load balance migration tasks prefer migrate non-idle tasks
>> and suppress the interference of idle tasks migration on non-idle tasks,
>> the latency of non-idle tasks can be significantly reduced. Although
>> this will cause some idle tasks imbalance between different CPUs and
>> reduce throughput of idle tasks., I think this strategy is feasible in
>> some real-time business scenarios for latency tasks.
> 
> But idle cfs ask remains cfs task and we keep cfs fairness for all threads
> 
> Have you tried to :
> - Increase nice priority of the non idle cfs task so the sum of the
> weight of idle tasks remain a small portion of the total weight ?
> - to put your thousands idle tasks in a cgroup and set cpu.idle for
> this cgroup. This should also ensure that the weight of idle threads
> remains negligible compared to others.
> 
> I have tried both setup in my local system and I have 1 non idle task per CPU
> 
> Regards,
> Vincent
> 

yes I have tried to do them and the results are as expected.

But...as you mentioned above, if all idle tasks are placed in a cgroup 
with the minimum cpu shares or increase nice priority of non-idle tasks, 
the weight of idle tasks is negligible compared with that of non-idle 
tasks, this does not affect the final result.

However, if we try only consider changing the scheduler policy of idle 
tasks to SCHED_IDLE and do not want to modify nice priority of non-idle 
tasks, the weight of idle tasks and the interference on non-idle tasks 
needs to be reconsidered when tasks migration between CPUs.



Best Regards,
Song Zhang

>>
>>>>
>>>>>
>>>>> Best regards.
>>>>>
>>>>> Song Zhang
>>> .
> .

diff mbox series

Patch

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 303ee7dd0c7e..9b3673269ecc 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -32,6 +32,10 @@  extern unsigned int sysctl_numa_balancing_promote_rate_limit;
 #define sysctl_numa_balancing_mode	0
 #endif
 
+#ifdef CONFIG_SCHED_PRIO_LB
+extern unsigned int sysctl_sched_prio_load_balance_enabled;
+#endif
+
 int sysctl_numa_balancing(struct ctl_table *table, int write, void *buffer,
 		size_t *lenp, loff_t *ppos);
 
diff --git a/init/Kconfig b/init/Kconfig
index 694f7c160c9c..b0dfe6701218 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1026,6 +1026,16 @@  config CFS_BANDWIDTH
 	  restriction.
 	  See Documentation/scheduler/sched-bwc.rst for more information.
 
+config SCHED_PRIO_LB
+	bool "Priority load balance for CFS"
+	depends on SMP
+	default n
+	help
+	  This feature enable CFS priority load balance to reduce
+	  non-idle tasks latency interferenced by SCHED_IDLE tasks.
+	  It prefer migrating non-idle tasks firstly and
+	  migrating SCHED_IDLE tasks lastly.
+
 config RT_GROUP_SCHED
 	bool "Group scheduling for SCHED_RR/FIFO"
 	depends on CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5800b0623ff3..9be35431fdd5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9731,6 +9731,9 @@  void __init sched_init(void)
 		rq->max_idle_balance_cost = sysctl_sched_migration_cost;
 
 		INIT_LIST_HEAD(&rq->cfs_tasks);
+#ifdef CONFIG_SCHED_PRIO_LB
+		INIT_LIST_HEAD(&rq->cfs_idle_tasks);
+#endif
 
 		rq_attach_root(rq, &def_root_domain);
 #ifdef CONFIG_NO_HZ_COMMON
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4a0b8bd941c..bdeb04324f0c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -139,6 +139,10 @@  static int __init setup_sched_thermal_decay_shift(char *str)
 }
 __setup("sched_thermal_decay_shift=", setup_sched_thermal_decay_shift);
 
+#ifdef CONFIG_SCHED_PRIO_LB
+unsigned int sysctl_sched_prio_load_balance_enabled;
+#endif
+
 #ifdef CONFIG_SMP
 /*
  * For asym packing, by default the lower numbered CPU has higher priority.
@@ -3199,6 +3203,21 @@  static inline void update_scan_period(struct task_struct *p, int new_cpu)
 
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_SCHED_PRIO_LB
+static void
+adjust_rq_cfs_tasks(
+	void (*list_op)(struct list_head *, struct list_head *),
+	struct rq *rq,
+	struct sched_entity *se)
+{
+	if (sysctl_sched_prio_load_balance_enabled &&
+		task_has_idle_policy(task_of(se)))
+		(*list_op)(&se->group_node, &rq->cfs_idle_tasks);
+	else
+		(*list_op)(&se->group_node, &rq->cfs_tasks);
+}
+#endif
+
 static void
 account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -3208,7 +3227,11 @@  account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		struct rq *rq = rq_of(cfs_rq);
 
 		account_numa_enqueue(rq, task_of(se));
+#ifdef CONFIG_SCHED_PRIO_LB
+		adjust_rq_cfs_tasks(list_add, rq, se);
+#else
 		list_add(&se->group_node, &rq->cfs_tasks);
+#endif
 	}
 #endif
 	cfs_rq->nr_running++;
@@ -7631,7 +7654,11 @@  done: __maybe_unused;
 	 * the list, so our cfs_tasks list becomes MRU
 	 * one.
 	 */
+#ifdef CONFIG_SCHED_PRIO_LB
+	adjust_rq_cfs_tasks(list_move, rq, &p->se);
+#else
 	list_move(&p->se.group_node, &rq->cfs_tasks);
+#endif
 #endif
 
 	if (hrtick_enabled_fair(rq))
@@ -8156,11 +8183,18 @@  static void detach_task(struct task_struct *p, struct lb_env *env)
 static struct task_struct *detach_one_task(struct lb_env *env)
 {
 	struct task_struct *p;
+	struct list_head *tasks = &env->src_rq->cfs_tasks;
+#ifdef CONFIG_SCHED_PRIO_LB
+	bool has_detach_idle_tasks = false;
+#endif
 
 	lockdep_assert_rq_held(env->src_rq);
 
+#ifdef CONFIG_SCHED_PRIO_LB
+again:
+#endif
 	list_for_each_entry_reverse(p,
-			&env->src_rq->cfs_tasks, se.group_node) {
+			tasks, se.group_node) {
 		if (!can_migrate_task(p, env))
 			continue;
 
@@ -8175,6 +8209,13 @@  static struct task_struct *detach_one_task(struct lb_env *env)
 		schedstat_inc(env->sd->lb_gained[env->idle]);
 		return p;
 	}
+#ifdef CONFIG_SCHED_PRIO_LB
+	if (sysctl_sched_prio_load_balance_enabled && !has_detach_idle_tasks) {
+		has_detach_idle_tasks = true;
+		tasks = &env->src_rq->cfs_idle_tasks;
+		goto again;
+	}
+#endif
 	return NULL;
 }
 
@@ -8190,6 +8231,9 @@  static int detach_tasks(struct lb_env *env)
 	unsigned long util, load;
 	struct task_struct *p;
 	int detached = 0;
+#ifdef CONFIG_SCHED_PRIO_LB
+	bool has_detach_idle_tasks = false;
+#endif
 
 	lockdep_assert_rq_held(env->src_rq);
 
@@ -8205,6 +8249,9 @@  static int detach_tasks(struct lb_env *env)
 	if (env->imbalance <= 0)
 		return 0;
 
+#ifdef CONFIG_SCHED_PRIO_LB
+again:
+#endif
 	while (!list_empty(tasks)) {
 		/*
 		 * We don't want to steal all, otherwise we may be treated likewise,
@@ -8310,6 +8357,14 @@  static int detach_tasks(struct lb_env *env)
 		list_move(&p->se.group_node, tasks);
 	}
 
+#ifdef CONFIG_SCHED_PRIO_LB
+	if (sysctl_sched_prio_load_balance_enabled &&
+		!has_detach_idle_tasks && env->imbalance > 0) {
+		has_detach_idle_tasks = true;
+		tasks = &env->src_rq->cfs_idle_tasks;
+		goto again;
+	}
+#endif
 	/*
 	 * Right now, this is one of only two places we collect this stat
 	 * so we can safely collect detach_one_task() stats here rather
@@ -11814,7 +11869,11 @@  static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 		 * Move the next running task to the front of the list, so our
 		 * cfs_tasks list becomes MRU one.
 		 */
+#ifdef CONFIG_SCHED_PRIO_LB
+		adjust_rq_cfs_tasks(list_move, rq, se);
+#else
 		list_move(&se->group_node, &rq->cfs_tasks);
+#endif
 	}
 #endif
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1644242ecd11..1b831c05ba30 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1053,6 +1053,9 @@  struct rq {
 	int			online;
 
 	struct list_head cfs_tasks;
+#ifdef CONFIG_SCHED_PRIO_LB
+	struct list_head cfs_idle_tasks;
+#endif
 
 	struct sched_avg	avg_rt;
 	struct sched_avg	avg_dl;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 188c305aeb8b..5fc0f9ffb675 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2090,6 +2090,17 @@  static struct ctl_table kern_table[] = {
 		.extra1		= SYSCTL_ONE,
 		.extra2		= SYSCTL_INT_MAX,
 	},
+#endif
+#ifdef CONFIG_SCHED_PRIO_LB
+	{
+		.procname	= "sched_prio_load_balance_enabled",
+		.data		= &sysctl_sched_prio_load_balance_enabled,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 #endif
 	{ }
 };