[0/7] workqueue: Share the same PWQ for the CPUs of a pod and distribute max_active across pods

Message ID 20231227145143.2399-1-jiangshanlai@gmail.com
Headers
Series workqueue: Share the same PWQ for the CPUs of a pod and distribute max_active across pods |

Message

Lai Jiangshan Dec. 27, 2023, 2:51 p.m. UTC
  From: Lai Jiangshan <jiangshan.ljs@antgroup.com>

A different approach to fix the misbehavior can easily be exposed as reported in
http://lkml.kernel.org/r/dbu6wiwu3sdhmhikb2w6lns7b27gbobfavhjj57kwi2quafgwl@htjcc5oikcr3.

Lai Jiangshan (6):
  workqueue: Reuse the default PWQ as much as possible
  workqueue: Share the same PWQ for the CPUs of a pod
  workqueue: Add pwq_calculate_max_active()
  workqueue: Wrap common code into wq_adjust_pwqs_max_active()
  workqueue: Addjust pwq's max_active when CPU online/offine
  workqueue: Rename wq->saved_max_active to wq->max_active

Tejun Heo (1):
  workqueue: Implement system-wide max_active enforcement for unbound
    workqueues

 include/linux/workqueue.h |  34 +++++-
 kernel/workqueue.c        | 217 ++++++++++++++++++++++----------------
 2 files changed, 157 insertions(+), 94 deletions(-)
  

Comments

Tejun Heo Dec. 27, 2023, 11:06 p.m. UTC | #1
Hello, Lai.

On Wed, Dec 27, 2023 at 10:51:42PM +0800, Lai Jiangshan wrote:
>  static int pwq_calculate_max_active(struct pool_workqueue *pwq)
>  {
> +	int pwq_nr_online_cpus;
> +	int max_active;
> +
>  	/*
>  	 * During [un]freezing, the caller is responsible for ensuring
>  	 * that pwq_adjust_max_active() is called at least once after
> @@ -4152,7 +4158,18 @@ static int pwq_calculate_max_active(struct pool_workqueue *pwq)
>  	if ((pwq->wq->flags & WQ_FREEZABLE) && workqueue_freezing)
>  		return 0;
>  
> -	return pwq->wq->saved_max_active;
> +	if (!(pwq->wq->flags & WQ_UNBOUND))
> +		return pwq->wq->saved_max_active;
> +
> +	pwq_nr_online_cpus = cpumask_weight_and(pwq->pool->attrs->__pod_cpumask, cpu_online_mask);
> +	max_active = DIV_ROUND_UP(pwq->wq->saved_max_active * pwq_nr_online_cpus, num_online_cpus());

So, the problem with this approach is that we can end up segmenting
max_active to too many too small pieces. Imagine a system with an AMD EPYC
9754 - 256 threads spread across 16 L3 caches. Let's say there's a workqueue
used for IO (e.g. encryption) with the default CACHE affinity_scope ans
max_active of 2 * nr_cpus, which isn't uncommon for this type of workqueues.

The above code would limit each L3 domain to 32 concurent work items. Let's
say a thread which is pinned to a CPU is issuing a lot of concurrent writes
with the expectation of being able to saturate all the CPUs. It won't be
able to even get close. The expected behavior is saturating all 256 CPUs on
the system. The resulting behavior would be saturating an eight of them.

The crux of the problem is that the desired worker pool domain and
max_active enforcement domain don't match. We want to be fine grained with
the former but pretty close to the whole system for the latter.

Thanks.