From patchwork Tue Jan 30 18:33:36 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Waiman Long <longman@redhat.com>
X-Patchwork-Id: 194268
Return-Path: <linux-kernel+bounces-45115-ouuuleilei=gmail.com@vger.kernel.org>
Delivered-To: ouuuleilei@gmail.com
Received: by 2002:a05:7301:2087:b0:106:209c:c626 with SMTP id
 gs7csp1422146dyb;
        Tue, 30 Jan 2024 10:39:07 -0800 (PST)
X-Google-Smtp-Source: 
 AGHT+IFXXMj6x5x/O5Acdb2nBt2ke1DZIsOc4oH/FZN2KemmKyMTq35qa8VGLjjFMMwIsgEJmD9F
X-Received: by 2002:a17:90b:fce:b0:295:ad4e:9b9c with SMTP id
 gd14-20020a17090b0fce00b00295ad4e9b9cmr125424pjb.28.1706639946923;
        Tue, 30 Jan 2024 10:39:06 -0800 (PST)
ARC-Seal: i=2; a=rsa-sha256; t=1706639946; cv=pass;
        d=google.com; s=arc-20160816;
        b=marxuXu0SYju7x8sfD8uoL63LTXGZ6ztzn5tXEBDTURMfwpDMDkBPYHloyWI+YcEIZ
         I23hbr6nv1d/G8cCKqhfJNCZWvdEuHGIca9Xrax3xGx4AFgGwx7ih6PXFGdwSXxj7mqE
         t9GblVpRimDUwjsh9OauBTAAlaX7FaRJ5SV2kwZPqrCUkM/MmzR7ZygA5DeLzJEPvkvZ
         g/H8Zy1ut/b0qCvjrLfsM/9ivZrGxMzRkHtFUZ7YN8KfzKbFtSm0S82Wen5yufXt16Nd
         PU2ivmdjrQfKTzNvLpOdcO9zVWdeerFSCGTNXCQYmMZ4OPZn2xyxtc1Z3rGEyiFW1dxN
         dD3w==
ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
 s=arc-20160816;
        h=content-transfer-encoding:mime-version:list-unsubscribe
         :list-subscribe:list-id:precedence:references:in-reply-to:message-id
         :date:subject:cc:to:from:dkim-signature;
        bh=n0l50zp4DV93ToPjoT6lIoiAHIsdSlkpo+6laKp6mmI=;
        fh=lt2js4+oSVfpbDjw8gm9/JJoU0YKcT6Mty9go3Ytfy0=;
        b=bU8sC2nW4IQhk4by31htSdidH+uTTR2BVdfS1zMDonp7VId7H3zxQ7Gx6uAY24Ymlv
         kzYDiyzS+q1QU3MCWSV4a/Gd22HJP7bNT8u/DL0qOSpYrYvk3FoijjspDRt+utQ/uZfc
         63h1kgNdsHxE+rjCBh2BVVDDAROqdE2OUv+ylJWa5iiD5/VMbx26kCug8B7OpXHoCLDr
         wp3dg1nIFTpu1mUkBJ17zHMCBMOe52O8lAMq++TfiwPVKcu51EiNP9SOJ4BlQZhytXib
         B5HEu6gRvNZkd28CO/4wBGm3f8svWw2PhA5Mk4FBIE35InlvRosxcH1yL5xhz2WSL9Ii
         UTsQ==
ARC-Authentication-Results: i=2; mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=NUf3U2Mw;
       arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass
 dkdomain=redhat.com dmarc=pass fromdomain=redhat.com);
       spf=pass (google.com: domain of
 linux-kernel+bounces-45115-ouuuleilei=gmail.com@vger.kernel.org designates
 147.75.48.161 as permitted sender)
 smtp.mailfrom="linux-kernel+bounces-45115-ouuuleilei=gmail.com@vger.kernel.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [147.75.48.161])
        by mx.google.com with ESMTPS id
 p12-20020a17090a930c00b0028e8dace207si9929571pjo.36.2024.01.30.10.39.06
        for <ouuuleilei@gmail.com>
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 30 Jan 2024 10:39:06 -0800 (PST)
Received-SPF: pass (google.com: domain of
 linux-kernel+bounces-45115-ouuuleilei=gmail.com@vger.kernel.org designates
 147.75.48.161 as permitted sender) client-ip=147.75.48.161;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@redhat.com header.s=mimecast20190719
 header.b=NUf3U2Mw;
       arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass
 dkdomain=redhat.com dmarc=pass fromdomain=redhat.com);
       spf=pass (google.com: domain of
 linux-kernel+bounces-45115-ouuuleilei=gmail.com@vger.kernel.org designates
 147.75.48.161 as permitted sender)
 smtp.mailfrom="linux-kernel+bounces-45115-ouuuleilei=gmail.com@vger.kernel.org";
       dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com
Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org
 [52.25.139.140])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by sy.mirrors.kernel.org (Postfix) with ESMTPS id 0B2A0B24F8D
	for <ouuuleilei@gmail.com>; Tue, 30 Jan 2024 18:34:56 +0000 (UTC)
Received: from localhost.localdomain (localhost.localdomain [127.0.0.1])
	by smtp.subspace.kernel.org (Postfix) with ESMTP id CF4C878B58;
	Tue, 30 Jan 2024 18:34:07 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b="NUf3U2Mw"
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id BD8D1762C6
	for <linux-kernel@vger.kernel.org>; Tue, 30 Jan 2024 18:34:01 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=170.10.129.124
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1706639643; cv=none;
 b=mEcIzLQLi95J2y2Bo/swU+UTh1tJIQZOk6oqmDpNFnlZoKmVCa5BBeCaC9W5m0BPFj3AvGaUwUsi2E0WUwe9fA5d7YtZn7Nw7TTv7rRE4trTA3FnTOGZ7CgA3TgmZUzy0g59HQfyy9Ol1N3TROqSovRxa2l/IOZ5dRsowZx/OdA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1706639643; c=relaxed/simple;
	bh=VQCexM6e++RY+6jzM1DaC5wChCDVDeV0eB8BVjKjVgY=;
	h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References:
	 MIME-Version;
 b=HO1rtQ82vG8Z5Llpiyo0YRYCIbpMeLtVsR5XSIXIcERVS51HqGiKMPS7VPJGBFd7ZPnRn0/ZZBdhmL2azi4TBuMJF1gJzKpx9kiBsPX0E0WDScY/A+uHnpRFDackyzGfooCifsxJJytJcaeHKEe247YCvOp0BdPMyBnftea3VWU=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com;
 spf=pass smtp.mailfrom=redhat.com;
 dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com
 header.b=NUf3U2Mw; arc=none smtp.client-ip=170.10.129.124
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=redhat.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1706639640;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=n0l50zp4DV93ToPjoT6lIoiAHIsdSlkpo+6laKp6mmI=;
	b=NUf3U2MwoS3/G0R8SKk5wkcEJ9yCTxi7eu3tsRs7GR5PEHj05wGZzg0bjxxKFrmWjYDYqI
	+H8/rjq0k8c29WZnfn4wl+T6bN+crGQoVWxYuKI997M3tTnLsEXcO9rdmYDxKqLNVtrfbM
	jVKRdyoA//l3dNaJRk8yLjo/CZWZsyQ=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-537-IKjcfz2fNZWmlSRedM76xg-1; Tue, 30 Jan 2024 13:33:59 -0500
X-MC-Unique: IKjcfz2fNZWmlSRedM76xg-1
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com
 [10.11.54.2])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 795C284A298;
	Tue, 30 Jan 2024 18:33:57 +0000 (UTC)
Received: from llong.com (unknown [10.22.8.207])
	by smtp.corp.redhat.com (Postfix) with ESMTP id 19B3740CD14B;
	Tue, 30 Jan 2024 18:33:57 +0000 (UTC)
From: Waiman Long <longman@redhat.com>
To: Tejun Heo <tj@kernel.org>,
	Lai Jiangshan <jiangshanlai@gmail.com>
Cc: linux-kernel@vger.kernel.org,
	Juri Lelli <juri.lelli@redhat.com>,
	Cestmir Kalina <ckalina@redhat.com>,
	Alex Gladkov <agladkov@redhat.com>,
	Waiman Long <longman@redhat.com>
Subject: [RFC PATCH 3/3] workqueue: Enable unbound cpumask update on ordered
 workqueues
Date: Tue, 30 Jan 2024 13:33:36 -0500
Message-Id: <20240130183336.511948-4-longman@redhat.com>
In-Reply-To: <20240130183336.511948-1-longman@redhat.com>
References: <20240130183336.511948-1-longman@redhat.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2
X-getmail-retrieved-from-mailbox: INBOX
X-GMAIL-THRID: 1789541411024418352
X-GMAIL-MSGID: 1789541689172366924

Ordered workqueues does not currently follow changes made to the
global unbound cpumask because per-pool workqueue changes may break
the ordering guarantee. IOW, a work function in an ordered workqueue
may run on a cpuset isolated CPU.

This patch enables ordered workqueues to follow changes made to the
global unbound cpumask by temporaily saving the work items in an
internal queue until the old pwq has been properly flushed and to be
freed. At that point, those work items, if present, are queued back to
the new pwq to be executed.

This enables ordered workqueues to follow the unbound cpumask changes
like other unbound workqueues at the expense of some delay in execution
of work functions during the transition period.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 kernel/workqueue.c | 169 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 156 insertions(+), 13 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 98c741eb43af..0ecbeecc74f2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -320,11 +320,30 @@ struct workqueue_struct {
 	 */
 	struct rcu_head		rcu;
 
+	/*
+	 * For orderly transition from old pwq to new pwq in ordered workqueues.
+	 *
+	 * During transition, queue_work() will queue the work items in a
+	 * temporary o_list. Once the old pwq is properly flushed and to be
+	 * freed, the pending work items in o_list will be queued to the new
+	 * pwq to start execution.
+	 */
+	raw_spinlock_t		o_lock;	 /* for protecting o_list & o_state */
+	atomic_t		o_nr_qw; /* queue_work() in progress count */
+	int			o_state; /* pwq transition state */
+	struct list_head	o_list;	 /* pending ordered work items */
+
 	/* hot fields used during command issue, aligned to cacheline */
 	unsigned int		flags ____cacheline_aligned; /* WQ: WQ_* flags */
 	struct pool_workqueue __percpu __rcu **cpu_pwq; /* I: per-cpu pwqs */
 };
 
+enum ordered_wq_states {
+	ORD_NORMAL = 0,	/* default normal working state */
+	ORD_QUEUE,	/* queue works in o_list */
+	ORD_WAIT,	/* busy waiting */
+};
+
 static struct kmem_cache *pwq_cache;
 
 /*
@@ -1425,8 +1444,24 @@ static void get_pwq(struct pool_workqueue *pwq)
 static void put_pwq(struct pool_workqueue *pwq)
 {
 	lockdep_assert_held(&pwq->pool->lock);
+	lockdep_assert_irqs_disabled();
 	if (likely(--pwq->refcnt))
 		return;
+
+	/*
+	 * If pwq transition is in progress for ordered workqueue and
+	 * there is no pending work in wq->o_list, we can end this
+	 * transition period here.
+	 */
+	if (READ_ONCE(pwq->wq->o_state)) {
+		struct workqueue_struct *wq = pwq->wq;
+
+		raw_spin_lock(&wq->o_lock);
+		if (list_empty(&wq->o_list))
+			WRITE_ONCE(wq->o_state, ORD_NORMAL);
+		raw_spin_unlock(&wq->o_lock);
+	}
+
 	/*
 	 * @pwq can't be released under pool->lock, bounce to a dedicated
 	 * kthread_worker to avoid A-A deadlocks.
@@ -1795,6 +1830,8 @@ static void __queue_work_rcu_locked(int cpu, struct workqueue_struct *wq,
 static void __queue_work(int cpu, struct workqueue_struct *wq,
 			 struct work_struct *work)
 {
+	bool owq = wq->flags & __WQ_ORDERED_EXPLICIT;
+
 	/*
 	 * While a work item is PENDING && off queue, a task trying to
 	 * steal the PENDING will busy-loop waiting for it to either get
@@ -1813,7 +1850,35 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
 		return;
 
 	rcu_read_lock();
+	if (owq) {
+		/* Provide an acquire barrier */
+		atomic_inc_return_acquire(&wq->o_nr_qw);
+		for (;;) {
+			int ostate = READ_ONCE(wq->o_state);
+
+			if (!ostate)
+				break;
+			if (ostate == ORD_QUEUE) {
+				int new_ostate;
+
+				raw_spin_lock(&wq->o_lock);
+				new_ostate = READ_ONCE(wq->o_state);
+				if (unlikely(new_ostate != ostate)) {
+					raw_spin_unlock(&wq->o_lock);
+					continue;
+				}
+				list_add_tail(&work->entry, &wq->o_list);
+				raw_spin_unlock(&wq->o_lock);
+				goto unlock_out;
+			} else {	/* ostate == ORD_WAIT */
+				cpu_relax();
+			}
+		}
+	}
 	__queue_work_rcu_locked(cpu, wq, work);
+unlock_out:
+	if (owq)
+		atomic_dec(&wq->o_nr_qw);
 	rcu_read_unlock();
 }
 
@@ -4107,6 +4172,57 @@ static void rcu_free_pwq(struct rcu_head *rcu)
 			container_of(rcu, struct pool_workqueue, rcu));
 }
 
+/* requeue the work items stored in wq->o_list */
+static void requeue_ordered_works(struct workqueue_struct *wq)
+{
+	LIST_HEAD(head);
+	struct work_struct *work, *next;
+
+	raw_spin_lock_irq(&wq->o_lock);
+	if (list_empty(&wq->o_list))
+		goto unlock_out;	/* No requeuing is needed */
+
+	list_splice_init(&wq->o_list, &head);
+	raw_spin_unlock_irq(&wq->o_lock);
+
+	/*
+	 * Requeue the first batch of work items. Since it may take a while
+	 * to drain the old pwq and update the workqueue attributes, there
+	 * may be a rather long list of work items to process. So we allow
+	 * queue_work() callers to continue putting their work items in o_list.
+	 */
+	list_for_each_entry_safe(work, next, &head, entry) {
+		list_del_init(&work->entry);
+		local_irq_disable();
+		__queue_work_rcu_locked(WORK_CPU_UNBOUND, wq, work);
+		local_irq_enable();
+	}
+
+	/*
+	 * Now check if there are more work items queued, if so set ORD_WAIT
+	 * and force incoming queue_work() callers to busy wait until the 2nd
+	 * batch of work items have been properly requeued. It is assumed
+	 * that the 2nd batch should be much smaller.
+	 */
+	raw_spin_lock_irq(&wq->o_lock);
+	if (list_empty(&wq->o_list))
+		goto unlock_out;
+	WRITE_ONCE(wq->o_state, ORD_WAIT);
+	list_splice_init(&wq->o_list, &head);
+	raw_spin_unlock(&wq->o_lock);	/* Leave interrupt disabled */
+	list_for_each_entry_safe(work, next, &head, entry) {
+		list_del_init(&work->entry);
+		__queue_work_rcu_locked(WORK_CPU_UNBOUND, wq, work);
+	}
+	WRITE_ONCE(wq->o_state, ORD_NORMAL);
+	local_irq_enable();
+	return;
+
+unlock_out:
+	WRITE_ONCE(wq->o_state, ORD_NORMAL);
+	raw_spin_unlock_irq(&wq->o_lock);
+}
+
 /*
  * Scheduled on pwq_release_worker by put_pwq() when an unbound pwq hits zero
  * refcnt and needs to be destroyed.
@@ -4123,6 +4239,9 @@ static void pwq_release_workfn(struct kthread_work *work)
 	 * When @pwq is not linked, it doesn't hold any reference to the
 	 * @wq, and @wq is invalid to access.
 	 */
+	if (READ_ONCE(wq->o_state) && !WARN_ON_ONCE(list_empty(&pwq->pwqs_node)))
+		requeue_ordered_works(wq);
+
 	if (!list_empty(&pwq->pwqs_node)) {
 		mutex_lock(&wq->mutex);
 		list_del_rcu(&pwq->pwqs_node);
@@ -4389,6 +4508,17 @@ apply_wqattrs_prepare(struct workqueue_struct *wq,
 	cpumask_copy(new_attrs->__pod_cpumask, new_attrs->cpumask);
 	ctx->attrs = new_attrs;
 
+	/*
+	 * For initialized ordered workqueues, start the pwq transition
+	 * sequence of setting o_state to ORD_QUEUE and wait until there
+	 * is no outstanding queue_work() caller in progress.
+	 */
+	if (!list_empty(&wq->pwqs) && (wq->flags & __WQ_ORDERED_EXPLICIT)) {
+		smp_store_mb(wq->o_state, ORD_QUEUE);
+		while (atomic_read(&wq->o_nr_qw))
+			cpu_relax();
+	}
+
 	ctx->wq = wq;
 	return ctx;
 
@@ -4429,13 +4559,8 @@ static int apply_workqueue_attrs_locked(struct workqueue_struct *wq,
 	if (WARN_ON(!(wq->flags & WQ_UNBOUND)))
 		return -EINVAL;
 
-	/* creating multiple pwqs breaks ordering guarantee */
-	if (!list_empty(&wq->pwqs)) {
-		if (WARN_ON(wq->flags & __WQ_ORDERED_EXPLICIT))
-			return -EINVAL;
-
+	if (!list_empty(&wq->pwqs) && !(wq->flags & __WQ_ORDERED_EXPLICIT))
 		wq->flags &= ~__WQ_ORDERED;
-	}
 
 	ctx = apply_wqattrs_prepare(wq, attrs, wq_unbound_cpumask);
 	if (IS_ERR(ctx))
@@ -4713,6 +4838,9 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
 	INIT_LIST_HEAD(&wq->flusher_queue);
 	INIT_LIST_HEAD(&wq->flusher_overflow);
 	INIT_LIST_HEAD(&wq->maydays);
+	INIT_LIST_HEAD(&wq->o_list);
+	atomic_set(&wq->o_nr_qw, 0);
+	raw_spin_lock_init(&wq->o_lock);
 
 	wq_init_lockdep(wq);
 	INIT_LIST_HEAD(&wq->list);
@@ -5793,11 +5921,27 @@ static int workqueue_apply_unbound_cpumask(const cpumask_var_t unbound_cpumask)
 		if (!(wq->flags & WQ_UNBOUND) || (wq->flags & __WQ_DESTROYING))
 			continue;
 
-		/* creating multiple pwqs breaks ordering guarantee */
+		/*
+		 * We does not support changing attrs of ordered workqueue
+		 * again before the previous attrs change is completed.
+		 * Sleep up to 100ms in 10ms interval to allow previous
+		 * operation to complete and skip it if not done by then.
+		 */
 		if (!list_empty(&wq->pwqs)) {
-			if (wq->flags & __WQ_ORDERED_EXPLICIT)
-				continue;
-			wq->flags &= ~__WQ_ORDERED;
+			if (!(wq->flags & __WQ_ORDERED_EXPLICIT))
+				wq->flags &= ~__WQ_ORDERED;
+			else if (READ_ONCE(wq->o_state)) {
+				int i, ostate;
+
+				for (i = 0; i < 10; i++) {
+					msleep(10);
+					ostate = READ_ONCE(wq->o_state);
+					if (!ostate)
+						break;
+				}
+				if (WARN_ON_ONCE(ostate))
+					continue;
+			}
 		}
 
 		ctx = apply_wqattrs_prepare(wq, wq->unbound_attrs, unbound_cpumask);
@@ -6313,9 +6457,8 @@ int workqueue_sysfs_register(struct workqueue_struct *wq)
 	int ret;
 
 	/*
-	 * Adjusting max_active or creating new pwqs by applying
-	 * attributes breaks ordering guarantee.  Disallow exposing ordered
-	 * workqueues.
+	 * Adjusting max_active breaks ordering guarantee.  Disallow exposing
+	 * ordered workqueues.
 	 */
 	if (WARN_ON(wq->flags & __WQ_ORDERED_EXPLICIT))
 		return -EINVAL;